we are experiencing severe performance problems with a search head and could not really find a cause for this. So I hope to get a few more hints or ideas.
The problem shows in the SH being extreeemly slow in responding or being unresponsive at all.
It can take minutes for the search interface to fully load.
If Splunk is restarted on the SH the problem is gone and performance is good again. But eventually the problem comes back, either within minutes but it can also take several hours.
Virtual Linux Red Hat 5.5 on ESX Host, 8 vCPUs, 32 GB RAM, (shared) GBit network interface
Splunk version 5.0.2. SH does distributed searches to 2 indexers.
When the problem is present the SH displays the messages:
Your network connection may have been lost or Splunk web may be down.
Shortly after that the following message appears and then both of them stay or get refreshed.
Your network connection was either restored or Splunk web was restarted.
Its definitely not a network problem. Interface utitlization is always below 1MB/s
The phyisical interface of the ESX host is also far from being utilized.
From a machine resources point of view things don't look bad (stats taken when issue is present):
top - 16:06:00 up 102 days, 4:33, 1 user, load average: 0.47, 1.09, 1.00 Tasks: 180 total, 1 running, 178 sleeping, 0 stopped, 1 zombie Cpu(s): 23.7%us, 1.9%sy, 0.0%ni, 74.1%id, 0.1%wa, 0.0%hi, 0.2%si, 0.0%st Mem: 32959872k total, 16477580k used, 16482292k free, 7452048k buffers Swap: 7340000k total, 32968k used, 7307032k free, 6447520k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 9815 splunk 19 0 210m 125m 9572 S 88.1 0.4 0:03.32 splunkd 28431 splunk 15 0 1358m 851m 15m S 23.9 2.6 345:40.15 splunkd 8714 tzhlefl1 15 0 10864 1092 760 R 0.7 0.0 0:00.24 top 2757 root 10 -5 0 0 0 S 0.3 0.0 212:19.67 kjournald 8566 splunk 20 0 90480 24m 9188 S 0.3 0.1 0:01.05 splunkd 32401 splunk 19 0 106m 27m 9204 S 0.3 0.1 0:42.16 splunkd ... $ vmstat 5 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 1 0 32968 16253304 7452748 6464240 0 0 0 210 0 0 16 1 82 0 0 2 0 32968 16127468 7452748 6464252 0 0 0 193 2091 1137 20 1 79 0 0 4 0 32968 16130308 7452748 6464864 0 0 0 759 1746 709 21 1 78 0 0 1 0 32968 16067828 7452764 6465200 0 0 0 487 1588 610 17 1 82 0 0 1 0 32968 16005332 7452764 6465188 0 0 0 118 1453 528 15 0 85 0 0 1 0 32968 16034304 7452768 6466060 0 0 0 198 1599 534 17 1 83 0 0 ...
We have assigned dedicated CPU ressources and RAM to the VM to be sure it is not a ESX ressource allocation problem.
Checks and info Splunk wise:
The SH has quite a lot of scheduled searches configured, concurrent searches are averaging around 12, with peaks upt to 20 (taken from metrics.log). Additionally, there are 4 realtime searches running.
In the logs I don't have any errors indicating performance issues but there are a lot of "Search not executed" errors from the Sheduled due to exeeding of concurrent searches or disk quota of users.
Inspecting job run times show, that jobs are executed with ok performance but it takes lots of time to display the results. A search run from the GUI can have a Run Time of 1-2 seconds but takes a minute to show the first results.
Running searches from the CLI returns results much faster.
S.o.S. (Splunk on Splunk) only hint to a possible issue are the high numer of search artifacts in the dispatch directory. And there seems to be a memory leak in splunkd (memory usage slowly rises consistantly) but the problem occurence does not correlate with that.
So I suspect some issue with result preparation and display but I don't have any more clues on how to speed it up or where to troubleshoot/tune to get a grip on the issue.
As it goes away when restarting Splunk points towards an application issue but I have no idea what causes it.
Any thoughts and hints are highly appreciated.
Updated Post with answers to followup questions:
I still suspect some wired splunkd/splunk-web behaviour when the issue kicks in but inspecting the different internal SH logs did not yield any helpful clues yet.
We are running a cluster with 2 peers, 1 master, and 1 SH, all physical servers (32gb ram and 24 cpu cores each), on Linux x64.
We found that its related to the concurrent users (not concurrent searches). Restarting just splunkweb process fix the issue.
Also we found this:
The web.conf docs says this:
# # cherrypy HTTP server config # server.thread_pool = <integer> * Specifies the minimum number of threads the appserver is allowed to maintain * Defaults to 50 server.thread_pool_max = <integer> * Specifies the maximum number of threads the appserver is allowed to maintain * Defaults to -1 (unlimited) server.thread_pool_min_spare = <integer> * Specifies the minimum number of spare threads the appserver keeps idle * Defaults to 15 server.thread_pool_max_spare = <integer> * Specifies the maximum number of spare threads the appserver keeps idle * Defaults to 30
But our default default web.conf file has only 3 of these settings, and all of them are commented out. We checked with btool and these settings are not being defined by any app.
So the documentation doesn't match with the default splunk configuration... Maybe the issue is related to this...
It seems to be a known bug, fixed in 5.0.3. In the meantime we solved it by adding this to the web.conf file:
thread_pool_max_spare = 300
This worked for us, but maybe it doesn´t work for your issues. I would recommend open a case with support, as they helped us to solve our issue.
i saw that your CPU usage is not that much and the searches are fine with 20 instance of them are running. The only thing may happen is the searches are not efficient and usage more where clause. The index size ma be big and searching on the whole will take a toll on IO operations. We have 20/30 searches running but without much problem. If it's possible could you disable some of the searches and see whether performance goes up? if not try to configure some searches rather than realtime to run per minute or two, that would not be much of a difference.
we did not solve the issue yet. We could reduce the impact in reducing the load (concurrent searches) but did not find the root cause.
Did you solve it by now?
Did the the web config parameters yield any new insight?
Someone from splunk can probably provide a much better description but ... in terms of mounts vs replicated bundles it refers to how the search artifacts(?) (props/transforms etc) get to the index when a search is requested by a search head.
They are either provide by a replicated bundle(a tarball of all the required files) or a mounted bundle(a location that the index can readily access at any time).
The difference being that in some circumstances replication can cause slow downs due to slow replication (ie. large lookups inside your app are copied to the index).
I know Splunk does not tolerate cpu ready time. Do you know how to calculate percent CPU ready time?
- (CPU summation value / (
-(1000 / (20s * 1000)) * 100 = 5% CPU ready
Realtime: 20 seconds
Past Day: 5 minutes (300 seconds)
Past Week: 30 minutes (1800 seconds)
Past Month: 2 hours (7200 seconds)
Past Year: 1 day (86400 seconds)
Are you running the S.o.S. app? http://apps.splunk.com/app/748
Is RHEL setup according to VMware timekeeping guidelines? See their caveat about use of the divider=10 kernel param: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=100642...
Does the jobs inspector show any long-running searches? I see in your 'top' output you have a splunkd with 1358M of memory and a lot of cpu time under its belt.
Is your VMware infrastructure up to par? How many real sockets and cores do you have available? Are your VMware tools installed, up to date, and running? Are there any interesting events or alerts on the VMware side?
Question are you using CPU reservation or affinity? Also have looked at your virtual network metrics and os network stats ( retrans, drop packets in/out, connects established/passive/active)? What vmguest stats have you viewed?