Search Head severe performance issue

flle · ‎10-05-2013

Hi,

we are experiencing severe performance problems with a search head and could not really find a cause for this. So I hope to get a few more hints or ideas.
The problem shows in the SH being extreeemly slow in responding or being unresponsive at all.
It can take minutes for the search interface to fully load.
If Splunk is restarted on the SH the problem is gone and performance is good again. But eventually the problem comes back, either within minutes but it can also take several hours.

SH maschine:
Virtual Linux Red Hat 5.5 on ESX Host, 8 vCPUs, 32 GB RAM, (shared) GBit network interface
Splunk version 5.0.2. SH does distributed searches to 2 indexers.

When the problem is present the SH displays the messages:

Your network connection may have been lost or Splunk web may be down.

Shortly after that the following message appears and then both of them stay or get refreshed.

Your network connection was either restored or Splunk web was restarted.

Its definitely not a network problem. Interface utitlization is always below 1MB/s
The phyisical interface of the ESX host is also far from being utilized.

From a machine resources point of view things don't look bad (stats taken when issue is present):

top - 16:06:00 up 102 days,  4:33,  1 user,  load average: 0.47, 1.09, 1.00
Tasks: 180 total,   1 running, 178 sleeping,   0 stopped,   1 zombie
Cpu(s): 23.7%us,  1.9%sy,  0.0%ni, 74.1%id,  0.1%wa,  0.0%hi,  0.2%si,  0.0%st
Mem:  32959872k total, 16477580k used, 16482292k free,  7452048k buffers
Swap:  7340000k total,    32968k used,  7307032k free,  6447520k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 9815 splunk    19   0  210m 125m 9572 S 88.1  0.4   0:03.32 splunkd
28431 splunk    15   0 1358m 851m  15m S 23.9  2.6 345:40.15 splunkd
 8714 tzhlefl1  15   0 10864 1092  760 R  0.7  0.0   0:00.24 top
 2757 root      10  -5     0    0    0 S  0.3  0.0 212:19.67 kjournald
 8566 splunk    20   0 90480  24m 9188 S  0.3  0.1   0:01.05 splunkd
32401 splunk    19   0  106m  27m 9204 S  0.3  0.1   0:42.16 splunkd
...

$ vmstat 5
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b  swpd  free     buff    cache     si   so    bi    bo   in   cs us sy id wa st
 1  0  32968 16253304 7452748 6464240    0    0     0   210    0    0 16  1 82  0  0
 2  0  32968 16127468 7452748 6464252    0    0     0   193 2091 1137 20  1 79  0  0
 4  0  32968 16130308 7452748 6464864    0    0     0   759 1746  709 21  1 78  0  0
 1  0  32968 16067828 7452764 6465200    0    0     0   487 1588  610 17  1 82  0  0
 1  0  32968 16005332 7452764 6465188    0    0     0   118 1453  528 15  0 85  0  0
 1  0  32968 16034304 7452768 6466060    0    0     0   198 1599  534 17  1 83  0  0
...

We have assigned dedicated CPU ressources and RAM to the VM to be sure it is not a ESX ressource allocation problem.

Checks and info Splunk wise:

The SH has quite a lot of scheduled searches configured, concurrent searches are averaging around 12, with peaks upt to 20 (taken from metrics.log). Additionally, there are 4 realtime searches running.

In the logs I don't have any errors indicating performance issues but there are a lot of "Search not executed" errors from the Sheduled due to exeeding of concurrent searches or disk quota of users.

/var/run/splunk/dispatch contains around 1000 directories during normal operations with the oldes entries dating back 3 days. I suspect this could cause trouble but it is still far from the default maximum. And running a "splunk cmd splunkd clean-dispatch ... -1hours" does not help during the issue

Inspecting job run times show, that jobs are executed with ok performance but it takes lots of time to display the results. A search run from the GUI can have a Run Time of 1-2 seconds but takes a minute to show the first results.
Running searches from the CLI returns results much faster.

S.o.S. (Splunk on Splunk) only hint to a possible issue are the high numer of search artifacts in the dispatch directory. And there seems to be a memory leak in splunkd (memory usage slowly rises consistantly) but the problem occurence does not correlate with that.

So I suspect some issue with result preparation and display but I don't have any more clues on how to speed it up or where to troubleshoot/tune to get a grip on the issue.
As it goes away when restarting Splunk points towards an application issue but I have no idea what causes it.

Any thoughts and hints are highly appreciated.

Regards
Flo

Updated Post with answers to followup questions:

CPU reservation is used
ESX Server has 4 10Core CPUs with active hyperthreading. so: 40 physical cores / 80 logical processors
There is only CPU overcommiting (NO memory overcommiting)
ESX on which the SH runs has 27 VMs with alltogether 100 vCPUs.
CPU ready percentage for SH VM is around 3%
network metrics are ok. No errors, dropped packets, retransmits, overruns, etc.
VMWare tools are installed and running. Within VM checked everything that vmware-toolbox offers. All ok
from outside VM (ESX management) checked for ressource issues (network, CPU, RAM). Only thing was peaks in CPU ready states. Thats when we introduced CPU reservation. Did not help
CPU ready is at around 3.5% for the SH VM
SH is NOT acting as indexer (not even summaries).
VMWare infrastructure is ok. No alerts/warnings on ESX side.
Time is set via ntp not VMWare timekeeping (its disabled)
There are some long runing searches but they are well distributed across the day (or rather night). As said, concurrent searche during the day is between 10 and 20 (taken from metrics log)

I still suspect some wired splunkd/splunk-web behaviour when the issue kicks in but inspecting the different internal SH logs did not yield any helpful clues yet.

gfuente · ‎11-08-2013

Hello

We are running a cluster with 2 peers, 1 master, and 1 SH, all physical servers (32gb ram and 24 cpu cores each), on Linux x64.

We found that its related to the concurrent users (not concurrent searches). Restarting just splunkweb process fix the issue.

Also we found this:

The web.conf docs says this:

http://docs.splunk.com/Documentation/Splunk/5.0.2/Admin/Webconf

#
# cherrypy HTTP server config
#

server.thread_pool = <integer>
   * Specifies the minimum number of threads the appserver is allowed to maintain
   * Defaults to 50

server.thread_pool_max = <integer>
   * Specifies the maximum number of threads the appserver is allowed to maintain
   * Defaults to -1 (unlimited)

server.thread_pool_min_spare = <integer>
    * Specifies the minimum number of spare threads the appserver keeps idle
    * Defaults to 15

server.thread_pool_max_spare = <integer>
    * Specifies the maximum number of spare threads the appserver keeps idle
    * Defaults to 30

But our default default web.conf file has only 3 of these settings, and all of them are commented out. We checked with btool and these settings are not being defined by any app.

So the documentation doesn't match with the default splunk configuration... Maybe the issue is related to this...

Regards

gfuente · ‎11-18-2013

Hello

It seems to be a known bug, fixed in 5.0.3. In the meantime we solved it by adding this to the web.conf file:

thread_pool_max_spare = 300

This worked for us, but maybe it doesn´t work for your issues. I would recommend open a case with support, as they helped us to solve our issue.

Regards

linu1988 · ‎11-11-2013

Hello Flle,
i saw that your CPU usage is not that much and the searches are fine with 20 instance of them are running. The only thing may happen is the searches are not efficient and usage more where clause. The index size ma be big and searching on the whole will take a toll on IO operations. We have 20/30 searches running but without much problem. If it's possible could you disable some of the searches and see whether performance goes up? if not try to configure some searches rather than realtime to run per minute or two, that would not be much of a difference.

flle · ‎11-11-2013

Hello gfuente,

we did not solve the issue yet. We could reduce the impact in reducing the load (concurrent searches) but did not find the root cause.
Did you solve it by now?
Did the the web config parameters yield any new insight?

Regards

Lucas_K · ‎11-07-2013

Someone from splunk can probably provide a much better description but ... in terms of mounts vs replicated bundles it refers to how the search artifacts(?) (props/transforms etc) get to the index when a search is requested by a search head.

They are either provide by a replicated bundle(a tarball of all the required files) or a mounted bundle(a location that the index can readily access at any time).

The difference being that in some circumstances replication can cause slow downs due to slow replication (ie. large lookups inside your app are copied to the index).

ShaneNewman · ‎11-07-2013

We have the same problem on our WS 2003R2 boxes. Are you running the same OS?

gfuente · ‎11-07-2013

Hello

We are facing exactly this very same issue, but with physical servers. Also we are running 5.0.2. Did you find the root of the problem? Any solution provided?

thanks

flle · ‎10-10-2013

Updated post with additional info/answers to follow-up questions.

@Lucas K: what do you mean with mounted or replicated bundles and what would be the impact (or benefit) of each?

Lucas_K · ‎10-06-2013

Are you using mounted or replicated bundles?

jspears · ‎10-06-2013

Again, you have 8 vCPUs allocated. How many sockets and cores are available to satisfy that?

Have you tried lowering the number of vCPUs and measuring response times?

bmacias84 · ‎10-06-2013

I know Splunk does not tolerate cpu ready time. Do you know how to calculate percent CPU ready time?
- (CPU summation value / ( * 1000)) * 100 = CPU ready %
-(1000 / (20s * 1000)) * 100 = 5% CPU ready

Realtime: 20 seconds
Past Day: 5 minutes (300 seconds)
Past Week: 30 minutes (1800 seconds)
Past Month: 2 hours (7200 seconds)
Past Year: 1 day (86400 seconds)

jspears · ‎10-06-2013

Are you running the S.o.S. app? http://apps.splunk.com/app/748

Is RHEL setup according to VMware timekeeping guidelines? See their caveat about use of the divider=10 kernel param: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=100642...

Does the jobs inspector show any long-running searches? I see in your 'top' output you have a splunkd with 1358M of memory and a lot of cpu time under its belt.

jspears · ‎10-05-2013

Is your VMware infrastructure up to par? How many real sockets and cores do you have available? Are your VMware tools installed, up to date, and running? Are there any interesting events or alerts on the VMware side?

bmacias84 · ‎10-05-2013

Is that search head acting as indexer?

bmacias84 · ‎10-05-2013

Question are you using CPU reservation or affinity? Also have looked at your virtual network metrics and os network stats ( retrans, drop packets in/out, connects established/passive/active)? What vmguest stats have you viewed?

Search Head severe performance issue

Enterprise Security Content Update (ESCU) | New Releases

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

Index This | What are the 12 Days of Splunk-mas?