I am seeing an issue on our Splunk server where we seem to be hitting a performance bottleneck. When generating charts from searches it takes a long time to render the results. This is especially noticable on dashboards using where multiple graphs may be rendered. As our server is limited in resources we have restricted the concurrent searches to 5 (we patient enough to wait for the results) there it appears that the search itself completes relatively quickly and the jobs report indicates that 98% of these searches take between 0 and 3 seconds to complete.
When loading a dashboard it can be seen that multiple splunkd processes are spawned, which I assume are carrying the concurrent searches. These splunkd processes disappear at times which would seem to correspond to the completion times shown in the job report. Around the same time the main splunkd process will rise up to as high at 300% and generate load average of up to 16. This increased CPU usage and corresponding load average remain high until all the the panels within the dashboard have been rendered and this rendering can take a long time. I would be prepared to accept this is delay is reasonable when the search is a long running and retrieves a lot of data, but this lag seems excessive for searches that only take between 0 and 3 seconds.
So, it would appear that we are hiiting a bottleneck at the point at which the panels or data used to generate the panels are rendered, maybe queuing or backing up a process which is single threaded. Has anyone else observed this behaviour and is there any way to improve this performance other than increasing the servers CPU?
Server specifications are:
CPU: 2x3GHz (hyper threaded processors)
cache size : 512 KB
memory: 5GB Ram
Storage: 1TB SAN
Example of the diffrence between normal and higher load:
Normal load
===========
top - 20:02:26 up 90 days, 10:25, 12 users, load average: 0.50, 1.04, 0.94
Tasks: 208 total, 1 running, 207 sleeping, 0 stopped, 0 zombie
Cpu(s): 7.5% us, 2.2% sy, 0.0% ni, 90.1% id, 0.2% wa, 0.0% hi, 0.0% si
Mem: 4933824k total, 4896456k used, 37368k free, 27384k buffers
Swap: 4192956k total, 10372k used, 4182584k free, 3867576k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
17878 syslogng 15 0 8776 7360 940 S 27 0.1 3508:47 syslog-ng
31911 root 16 0 698m 609m 9.9m S 10 12.7 260:36.48 splunkd
High load
=========
top - 19:55:07 up 90 days, 10:17, 12 users, load average: 4.36, 1.92, 0.97
Tasks: 217 total, 1 running, 216 sleeping, 0 stopped, 0 zombie
Cpu(s): 66.7% us, 7.2% sy, 0.0% ni, 13.4% id, 12.5% wa, 0.1% hi, 0.0% si
Mem: 4933824k total, 4900944k used, 32880k free, 23892k buffers
Swap: 4192956k total, 10372k used, 4182584k free, 3842728k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
24262 root 19 0 40900 15m 8196 S 127 0.3 0:21.68 splunkd
24131 root 18 0 56096 25m 8176 S 73 0.5 0:13.31 splunkd
31911 root 15 0 698m 609m 9.9m S 58 12.7 259:23.29 splunkd
17878 syslogng 15 0 8776 7360 940 S 27 0.1 3507:15 syslog-ng
... View more