Solved: Troubleshooting high Search Head CPU

Sqig · ‎03-20-2013

Hi. I have been struggling with getting to the root of some performance problems on our pool of search heads...which are two beefy servers. We do NOT see this performance issue on our other, identical site. The only difference is the users of the site and any searches they may run.

When I try a splunk restart, splunkweb always hangs and the python process ultimately has to be killed manually.

I have started using SoS to try to help figure this out.

It shows occasional Splunkweb CPU spikes but nothing that lasts and explains the persistent slowness of our system. However, "top" shows Splunkd as the culprit, so I'm unsure where to go from there.

Can anyone suggest how I might start narrowing this problem down?

I have already disabled any glaringly obvious user searches that would hose the system.

bandit · ‎03-20-2013

I would also check your dispatch directory. A large amount of dirs/files can slow things down.
$SPLUNK_HOME/var/run/splunk/dispatch
or if in a pooled space
[Pooled Share]/var/run/splunk/dispatch

to get a count of files/dirs in each directory

ls -l|wc -l

You might want to check for a large amount of files under the var dirs in general.

Here's also a search to calculate scheduled search lag to see if the scheduler is lagging. 30 seconds lag is probably normal but you may want to investigate above that. you can set the HIGH_WATERMARK to your liking as a reference point.

As a requirement, you will need to be indexing the scheduler.log

replace host names below with host names for your search heads

(host=hosta OR host=hostb) index=_internal source=*scheduler.log |eval JOB_DELAY_SECS=(dispatch_time-scheduled_time)|timechart span=5m perc95(JOB_DELAY_SECS) by host|eval HIGH_WATERMARK=100

If you are on Linux, you can run this command to see what splunkd or splunkweb is spending time on.

strace -p <splunk pid> -tt

View solution in original post

muhammad_luthfi · ‎01-26-2025

I've facing this issue also, and currently it solved.

First, need to see what actually is running, go to the console monitoring in the master.

run bellow to find the search query/name.

| rest /services/search/jobs 
| search isRealTimeSearch=1 
| table sid, dispatchState, runDuration, search, eventCount, resultCount, title, provenance, label

And i found the what is search indicated high CPU

Go to the job console in the top right side, and stop or delete the job.

Hopefully will be help.

bandit · ‎03-20-2013

I would also check your dispatch directory. A large amount of dirs/files can slow things down.
$SPLUNK_HOME/var/run/splunk/dispatch
or if in a pooled space
[Pooled Share]/var/run/splunk/dispatch

to get a count of files/dirs in each directory

ls -l|wc -l

You might want to check for a large amount of files under the var dirs in general.

Here's also a search to calculate scheduled search lag to see if the scheduler is lagging. 30 seconds lag is probably normal but you may want to investigate above that. you can set the HIGH_WATERMARK to your liking as a reference point.

As a requirement, you will need to be indexing the scheduler.log

replace host names below with host names for your search heads

(host=hosta OR host=hostb) index=_internal source=*scheduler.log |eval JOB_DELAY_SECS=(dispatch_time-scheduled_time)|timechart span=5m perc95(JOB_DELAY_SECS) by host|eval HIGH_WATERMARK=100

If you are on Linux, you can run this command to see what splunkd or splunkweb is spending time on.

strace -p <splunk pid> -tt

Sqig · ‎06-26-2013

This turned out to be where the problem was. There were session and session.lock files going back for over a year -- roughly 2 million. Caused by over-monitoring of the systems and an apparent bug (from what I read) in this older version of Splunk in cleaning up the files. Newer versions have this fixed.

hexx · ‎03-20-2013

Other tools you use to this effect include:

The S.o.S app can help you to track CPU usage at a per-process level for Splunk processes with the 'ps_sos.sh' scripted input. For more information, read this Splunk Answer.
The 'top' command scoped on the main splunkd process and split by thread.
top -H -p head -1 $SPLUNK_HOME/var/run/splunk/splunkd.pid
If the CPU usage is associated with the main splunkd process, this would allow you to identify the thread ID that is mainly responsible for it. Using 'pstack', you might even be able to take a good guess at which component that is. The Tailing Processor thread is usually easy to identify, for example.

jtrucks · ‎03-20-2013

Check the number of threads your splunkd process is using by doing:

ps -Lef | grep $splunkpid | wc -l

then try:

pstack $splunkpid > $outputfilename

See if any of that gets you on the right path.

--
Jesse Trucks
Minister of Magic

lguinn2 · ‎03-20-2013

I don't have any answer off the top of my head, so I'll just comment here: I once saw a system brought to its knees by a populating search for a summary index - it looked innocent enough, but the person who wrote it was unclear on the concept...

Troubleshooting high Search Head CPU

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Think Like an Architect: Introducing the Splunk Certified Cybersecurity Defense ...

Best Practices: Splunk auto adjust pipeline queue

Announcing Modern Navigation: A New Era of Splunk User Experience

Join the Conversation