As per another topic on "answers" I executed the following search:
index=_internal source=metrics.log group=queue | timechart perc95(current_size) by name
This confirms that my parsingqueue is almost always at 1000, which would probably explain why I have one splunkd process constantly using 100% of 1 out of 4 CPU's.
I am also receiving the following sequence of errors every 300ms from the splunkd.log, it might be a coincidence, it might be the cause.
02-22-2011 19:08:59.772 ERROR TcpInputProc - Received unexpected 68021378 byte message! from hostname=txxxxxxxxxx, ip=10.xxxxxxxx, port=45384 02-22-2011 19:08:59.772 INFO TcpInputProc - Hostname=txxxxxxxxxxxx closed connection 02-22-2011 19:08:59.855 INFO TcpInputProc - Connection in cooked mode from txxxxxxxxxxxx 02-22-2011 19:08:59.913 INFO TcpInputProc - Valid signature found 02-22-2011 19:08:59.913 INFO TcpInputProc - Connection accepted from txxxxxxxxxxx
Is it possible that some input from a forwarder keeps getting reprocessed?
Any pointers truly welcome.
The TcpInputProc errors you are seeing are mangled or invalid input on a splunktcp input. It might not be splunk at all, but something else connecting to that socket. If so, you could quiesce the source program, or firewall the access.
Alternatively that might be a quite old 4.0.x /3.4.x forwarder which is doing bad things with heartbeats. If it is a splunk forwarder, make sure it is running a relatively current version.
Splunk using 100% cpu is not so odd, if it has work to do. If it is getting behind, then it may be useful to look at cpu time by processor in metrics to see where most of the time is being spent.
Indexing can get behind by bottlenecks of disk write speed, or cpu exhaustion. I'd use system tools to get an idea about these (top, iostat). Then dig in further along those lines.
This probably becomes a support case, but you can get started if you want, with links like:
Your hints helped us identify the aggqueue and parsingqueue and the culprits. This answer from Gerald helped us fix it:
Glad to hear it is fixed! Sorry it is tricky to handle investigation cases in splunk answers.