Solved: splunkd Process at 100%, parsingQueue at 1000, how...

stephanbuys · ‎02-22-2011

As per another topic on "answers" I executed the following search:

index=_internal source=metrics.log group=queue | timechart perc95(current_size) by name

This confirms that my parsingqueue is almost always at 1000, which would probably explain why I have one splunkd process constantly using 100% of 1 out of 4 CPU's.

I am also receiving the following sequence of errors every 300ms from the splunkd.log, it might be a coincidence, it might be the cause.

02-22-2011 19:08:59.772 ERROR TcpInputProc - Received unexpected 68021378 byte message! from hostname=txxxxxxxxxx, ip=10.xxxxxxxx, port=45384

02-22-2011 19:08:59.772 INFO  TcpInputProc - Hostname=txxxxxxxxxxxx closed connection

02-22-2011 19:08:59.855 INFO  TcpInputProc - Connection in cooked mode from txxxxxxxxxxxx

02-22-2011 19:08:59.913 INFO  TcpInputProc - Valid signature found

02-22-2011 19:08:59.913 INFO  TcpInputProc - Connection accepted from txxxxxxxxxxx

Is it possible that some input from a forwarder keeps getting reprocessed?

Any pointers truly welcome.

jrodman · ‎02-23-2011

The TcpInputProc errors you are seeing are mangled or invalid input on a splunktcp input. It might not be splunk at all, but something else connecting to that socket. If so, you could quiesce the source program, or firewall the access.

Alternatively that might be a quite old 4.0.x /3.4.x forwarder which is doing bad things with heartbeats. If it is a splunk forwarder, make sure it is running a relatively current version.

Splunk using 100% cpu is not so odd, if it has work to do. If it is getting behind, then it may be useful to look at cpu time by processor in metrics to see where most of the time is being spent.

Indexing can get behind by bottlenecks of disk write speed, or cpu exhaustion. I'd use system tools to get an idea about these (top, iostat). Then dig in further along those lines.

This probably becomes a support case, but you can get started if you want, with links like:

http://www.splunk.com/wiki/Community:PerformanceTroubleshooting

http://www.splunk.com/wiki/Deploy:Troubleshooting

View solution in original post

jrodman · ‎02-23-2011

The TcpInputProc errors you are seeing are mangled or invalid input on a splunktcp input. It might not be splunk at all, but something else connecting to that socket. If so, you could quiesce the source program, or firewall the access.

Alternatively that might be a quite old 4.0.x /3.4.x forwarder which is doing bad things with heartbeats. If it is a splunk forwarder, make sure it is running a relatively current version.

Splunk using 100% cpu is not so odd, if it has work to do. If it is getting behind, then it may be useful to look at cpu time by processor in metrics to see where most of the time is being spent.

Indexing can get behind by bottlenecks of disk write speed, or cpu exhaustion. I'd use system tools to get an idea about these (top, iostat). Then dig in further along those lines.

This probably becomes a support case, but you can get started if you want, with links like:

http://www.splunk.com/wiki/Community:PerformanceTroubleshooting

http://www.splunk.com/wiki/Deploy:Troubleshooting

jrodman · ‎02-25-2011

Glad to hear it is fixed! Sorry it is tricky to handle investigation cases in splunk answers.

stephanbuys · ‎02-24-2011

Your hints helped us identify the aggqueue and parsingqueue and the culprits. This answer from Gerald helped us fix it:
http://answers.splunk.com/questions/1142/the-aggqueue-and-parsingqueue-consistently-full-blocked-how...

splunkd Process at 100%, parsingQueue at 1000, how do I determine where the issue lies?

Developer Spotlight with Paul Stout

State of Splunk Careers 2024: Maximizing Career Outcomes and the Continued Value of ...

Data-Driven Success: Splunk & Financial Services