Hi,
we are experiencing severe performance problems with a search head and could not really find a cause for this. So I hope to get a few more hints or ideas.
The problem shows in the SH being extreeemly slow in responding or being unresponsive at all.
It can take minutes for the search interface to fully load.
If Splunk is restarted on the SH the problem is gone and performance is good again. But eventually the problem comes back, either within minutes but it can also take several hours.
SH maschine:
Virtual Linux Red Hat 5.5 on ESX Host, 8 vCPUs, 32 GB RAM, (shared) GBit network interface
Splunk version 5.0.2. SH does distributed searches to 2 indexers.
When the problem is present the SH displays the messages:
Your network connection may have been lost or Splunk web may be down.
Shortly after that the following message appears and then both of them stay or get refreshed.
Your network connection was either restored or Splunk web was restarted.
Its definitely not a network problem. Interface utitlization is always below 1MB/s
The phyisical interface of the ESX host is also far from being utilized.
From a machine resources point of view things don't look bad (stats taken when issue is present):
top - 16:06:00 up 102 days, 4:33, 1 user, load average: 0.47, 1.09, 1.00
Tasks: 180 total, 1 running, 178 sleeping, 0 stopped, 1 zombie
Cpu(s): 23.7%us, 1.9%sy, 0.0%ni, 74.1%id, 0.1%wa, 0.0%hi, 0.2%si, 0.0%st
Mem: 32959872k total, 16477580k used, 16482292k free, 7452048k buffers
Swap: 7340000k total, 32968k used, 7307032k free, 6447520k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
9815 splunk 19 0 210m 125m 9572 S 88.1 0.4 0:03.32 splunkd
28431 splunk 15 0 1358m 851m 15m S 23.9 2.6 345:40.15 splunkd
8714 tzhlefl1 15 0 10864 1092 760 R 0.7 0.0 0:00.24 top
2757 root 10 -5 0 0 0 S 0.3 0.0 212:19.67 kjournald
8566 splunk 20 0 90480 24m 9188 S 0.3 0.1 0:01.05 splunkd
32401 splunk 19 0 106m 27m 9204 S 0.3 0.1 0:42.16 splunkd
...
$ vmstat 5
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 0 32968 16253304 7452748 6464240 0 0 0 210 0 0 16 1 82 0 0
2 0 32968 16127468 7452748 6464252 0 0 0 193 2091 1137 20 1 79 0 0
4 0 32968 16130308 7452748 6464864 0 0 0 759 1746 709 21 1 78 0 0
1 0 32968 16067828 7452764 6465200 0 0 0 487 1588 610 17 1 82 0 0
1 0 32968 16005332 7452764 6465188 0 0 0 118 1453 528 15 0 85 0 0
1 0 32968 16034304 7452768 6466060 0 0 0 198 1599 534 17 1 83 0 0
...
We have assigned dedicated CPU ressources and RAM to the VM to be sure it is not a ESX ressource allocation problem.
Checks and info Splunk wise:
The SH has quite a lot of scheduled searches configured, concurrent searches are averaging around 12, with peaks upt to 20 (taken from metrics.log). Additionally, there are 4 realtime searches running.
In the logs I don't have any errors indicating performance issues but there are a lot of "Search not executed" errors from the Sheduled due to exeeding of concurrent searches or disk quota of users.
/var/run/splunk/dispatch contains around 1000 directories during normal operations with the oldes entries dating back 3 days. I suspect this could cause trouble but it is still far from the default maximum. And running a "splunk cmd splunkd clean-dispatch ... -1hours" does not help during the issue
Inspecting job run times show, that jobs are executed with ok performance but it takes lots of time to display the results. A search run from the GUI can have a Run Time of 1-2 seconds but takes a minute to show the first results.
Running searches from the CLI returns results much faster.
S.o.S. (Splunk on Splunk) only hint to a possible issue are the high numer of search artifacts in the dispatch directory. And there seems to be a memory leak in splunkd (memory usage slowly rises consistantly) but the problem occurence does not correlate with that.
So I suspect some issue with result preparation and display but I don't have any more clues on how to speed it up or where to troubleshoot/tune to get a grip on the issue.
As it goes away when restarting Splunk points towards an application issue but I have no idea what causes it.
Any thoughts and hints are highly appreciated.
Regards
Flo
Updated Post with answers to followup questions:
CPU reservation is used
ESX Server has 4 10Core CPUs with active hyperthreading. so: 40 physical cores / 80 logical processors
There is only CPU overcommiting (NO memory overcommiting)
ESX on which the SH runs has 27 VMs with alltogether 100 vCPUs.
CPU ready percentage for SH VM is around 3%
network metrics are ok. No errors, dropped packets, retransmits, overruns, etc.
VMWare tools are installed and running. Within VM checked everything that vmware-toolbox offers. All ok
from outside VM (ESX management) checked for ressource issues (network, CPU, RAM). Only thing was peaks in CPU ready states. Thats when we introduced CPU reservation. Did not help
CPU ready is at around 3.5% for the SH VM
SH is NOT acting as indexer (not even summaries).
VMWare infrastructure is ok. No alerts/warnings on ESX side.
Time is set via ntp not VMWare timekeeping (its disabled)
There are some long runing searches but they are well distributed across the day (or rather night). As said, concurrent searche during the day is between 10 and 20 (taken from metrics log)
I still suspect some wired splunkd/splunk-web behaviour when the issue kicks in but inspecting the different internal SH logs did not yield any helpful clues yet.
... View more