I am seeing an issue on our Splunk server where we seem to be hitting a performance bottleneck. When generating charts from searches it takes a long time to render the results. This is especially noticable on dashboards using where multiple graphs may be rendered. As our server is limited in resources we have restricted the concurrent searches to 5 (we patient enough to wait for the results) there it appears that the search itself completes relatively quickly and the jobs report indicates that 98% of these searches take between 0 and 3 seconds to complete.
When loading a dashboard it can be seen that multiple splunkd processes are spawned, which I assume are carrying the concurrent searches. These splunkd processes disappear at times which would seem to correspond to the completion times shown in the job report. Around the same time the main splunkd process will rise up to as high at 300% and generate load average of up to 16. This increased CPU usage and corresponding load average remain high until all the the panels within the dashboard have been rendered and this rendering can take a long time. I would be prepared to accept this is delay is reasonable when the search is a long running and retrieves a lot of data, but this lag seems excessive for searches that only take between 0 and 3 seconds.
So, it would appear that we are hiiting a bottleneck at the point at which the panels or data used to generate the panels are rendered, maybe queuing or backing up a process which is single threaded. Has anyone else observed this behaviour and is there any way to improve this performance other than increasing the servers CPU?
Server specifications are:
CPU: 2x3GHz (hyper threaded processors) cache size : 512 KB memory: 5GB Ram Storage: 1TB SAN
Example of the diffrence between normal and higher load:
Normal load =========== top - 20:02:26 up 90 days, 10:25, 12 users, load average: 0.50, 1.04, 0.94 Tasks: 208 total, 1 running, 207 sleeping, 0 stopped, 0 zombie Cpu(s): 7.5% us, 2.2% sy, 0.0% ni, 90.1% id, 0.2% wa, 0.0% hi, 0.0% si Mem: 4933824k total, 4896456k used, 37368k free, 27384k buffers Swap: 4192956k total, 10372k used, 4182584k free, 3867576k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 17878 syslogng 15 0 8776 7360 940 S 27 0.1 3508:47 syslog-ng 31911 root 16 0 698m 609m 9.9m S 10 12.7 260:36.48 splunkd High load ========= top - 19:55:07 up 90 days, 10:17, 12 users, load average: 4.36, 1.92, 0.97 Tasks: 217 total, 1 running, 216 sleeping, 0 stopped, 0 zombie Cpu(s): 66.7% us, 7.2% sy, 0.0% ni, 13.4% id, 12.5% wa, 0.1% hi, 0.0% si Mem: 4933824k total, 4900944k used, 32880k free, 23892k buffers Swap: 4192956k total, 10372k used, 4182584k free, 3842728k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 24262 root 19 0 40900 15m 8196 S 127 0.3 0:21.68 splunkd 24131 root 18 0 56096 25m 8176 S 73 0.5 0:13.31 splunkd 31911 root 15 0 698m 609m 9.9m S 58 12.7 259:23.29 splunkd 17878 syslogng 15 0 8776 7360 940 S 27 0.1 3507:15 syslog-ng
This feels more support-y (or consulting) as opposed to splunk-answers-y. Support may be able to help clarify if there is something glaringly wrong (especially if you have a clearly answerable question to work on). They're not very good at open ended performance goals, though we do have a consulting team who can work on such things.
Generally speaking splunk searches are phased, with a period of IO, mixed possibly with high CPU, followed a period of usually high cpu, sometimes mixed with memory bandwidth limitations. It might be instructive to look at how your load changes over time to try to evaluate what's being exhausted.
The jobs inspector interface (not trivially to map from a dashboard to a search job, sorry) can provide numbers on where the job spent its wallclock time to help identify as well.
Macroscopically if you have several charts which are all essentially presenting information about the same serach concept, it's possible to rework your charts to all use a single search, which obviously reaps large benefits. There should be some examples of achieving this in, but I haven't grokked its secrets.
Search performance can also be tuned by looking at the specific searches and evaluating if there's a faster way to do them, whether they should really be precalculated somehow, etc.
More cpu may be wanted regardless, possibly in combination with the other approaches.
We do offer professional services to take on these types of things as a whole. They sell in blocks of time and are more focused on ensuring customer success as opposed to some less satisfying incarnations of that term.
We do have a support contract so I will take the question to Splunk support. Unfortunately the guy who handles the support is on holiday and impatiently, I asked here to try and get to the bottom of the issue and hoped that someone may have seen something similar before and may have had an easy solution.