Hi. I have been struggling with getting to the root of some performance problems on our pool of search heads...which are two beefy servers. We do NOT see this performance issue on our other, identical site. The only difference is the users of the site and any searches they may run.
When I try a splunk restart, splunkweb always hangs and the python process ultimately has to be killed manually.
I have started using SoS to try to help figure this out.
It shows occasional Splunkweb CPU spikes but nothing that lasts and explains the persistent slowness of our system. However, "top" shows Splunkd as the culprit, so I'm unsure where to go from there.
Can anyone suggest how I might start narrowing this problem down?
I have already disabled any glaringly obvious user searches that would hose the system.
I would also check your dispatch directory. A large amount of dirs/files can slow things down.
$SPLUNK_HOME/var/run/splunk/dispatch
or if in a pooled space
[Pooled Share]/var/run/splunk/dispatch
to get a count of files/dirs in each directory
ls -l|wc -l
You might want to check for a large amount of files under the var dirs in general.
Here's also a search to calculate scheduled search lag to see if the scheduler is lagging. 30 seconds lag is probably normal but you may want to investigate above that. you can set the HIGH_WATERMARK to your liking as a reference point.
As a requirement, you will need to be indexing the scheduler.log
replace host names below with host names for your search heads
(host=hosta OR host=hostb) index=_internal source=*scheduler.log |eval JOB_DELAY_SECS=(dispatch_time-scheduled_time)|timechart span=5m perc95(JOB_DELAY_SECS) by host|eval HIGH_WATERMARK=100
If you are on Linux, you can run this command to see what splunkd or splunkweb is spending time on.
strace -p <splunk pid> -tt
I would also check your dispatch directory. A large amount of dirs/files can slow things down.
$SPLUNK_HOME/var/run/splunk/dispatch
or if in a pooled space
[Pooled Share]/var/run/splunk/dispatch
to get a count of files/dirs in each directory
ls -l|wc -l
You might want to check for a large amount of files under the var dirs in general.
Here's also a search to calculate scheduled search lag to see if the scheduler is lagging. 30 seconds lag is probably normal but you may want to investigate above that. you can set the HIGH_WATERMARK to your liking as a reference point.
As a requirement, you will need to be indexing the scheduler.log
replace host names below with host names for your search heads
(host=hosta OR host=hostb) index=_internal source=*scheduler.log |eval JOB_DELAY_SECS=(dispatch_time-scheduled_time)|timechart span=5m perc95(JOB_DELAY_SECS) by host|eval HIGH_WATERMARK=100
If you are on Linux, you can run this command to see what splunkd or splunkweb is spending time on.
strace -p <splunk pid> -tt
This turned out to be where the problem was. There were session and session.lock files going back for over a year -- roughly 2 million. Caused by over-monitoring of the systems and an apparent bug (from what I read) in this older version of Splunk in cleaning up the files. Newer versions have this fixed.
Other tools you use to this effect include:
The S.o.S app can help you to track CPU usage at a per-process level for Splunk processes with the 'ps_sos.sh' scripted input. For more information, read this Splunk Answer.
The 'top' command scoped on the main splunkd process and split by thread.
top -H -p head -1 $SPLUNK_HOME/var/run/splunk/splunkd.pid
If the CPU usage is associated with the main splunkd process, this would allow you to identify the thread ID that is mainly responsible for it. Using 'pstack', you might even be able to take a good guess at which component that is. The Tailing Processor thread is usually easy to identify, for example.
Check the number of threads your splunkd process is using by doing:
ps -Lef | grep $splunkpid | wc -l
then try:
pstack $splunkpid > $outputfilename
See if any of that gets you on the right path.
I don't have any answer off the top of my head, so I'll just comment here: I once saw a system brought to its knees by a populating search for a summary index - it looked innocent enough, but the person who wrote it was unclear on the concept...