I was tracking down a problem with a group of Forwarders reporting the parsingqueue was blocked. To resolve that, I had to restart the Indexers. Looking at the metrics log on one of the Indexers, I see several messages about various queues being blocked; the aggqueue, the indexqueue, and the typingqueue. The splunktcpin queue blocked once in this group. These messages started showing up after the restart. I checked my file system and there is plenty of space (~64% free), so that is not the cause. I have not added any new Forwarders that would possibly be sending much more than usual the amount of data in for processing. What else might I consider as possible causes to this situation?
Output from "grep blocked=true metrics.log*" follows:
grep blocked=true metrics.log*
metrics.log:04-16-2012 18:57:06.181 -0500 INFO Metrics - group=queue, name=aggqueue, blocked=true, max_size_kb=1024, current_size_kb=1023, current_size=3083, largest_size=3250, smallest_size=538
metrics.log:04-16-2012 18:57:06.181 -0500 INFO Metrics - group=queue, name=indexqueue, blocked=true, max_size_kb=500, current_size_kb=499, current_size=928, largest_size=1391, smallest_size=0
metrics.log:04-16-2012 18:57:06.182 -0500 INFO Metrics - group=queue, name=typingqueue, blocked=true, max_size_kb=500, current_size_kb=499, current_size=647, largest_size=1386, smallest_size=0
metrics.log:04-16-2012 18:57:38.026 -0500 INFO Metrics - group=queue, name=splunktcpin, blocked=true, max_size_kb=500, current_size_kb=499, current_size=11, largest_size=153, smallest_size=0
metrics.log:04-16-2012 19:04:51.580 -0500 INFO Metrics - group=queue, name=typingqueue, blocked=true, max_size_kb=500, current_size_kb=499, current_size=1077, largest_size=1413, smallest_size=0
metrics.log:04-16-2012 19:05:23.285 -0500 INFO Metrics - group=queue, name=aggqueue, blocked=true, max_size_kb=1024, current_size_kb=1023, current_size=2927, largest_size=3408, smallest_size=773
metrics.log:04-16-2012 19:05:23.286 -0500 INFO Metrics - group=queue, name=indexqueue, blocked=true, max_size_kb=500, current_size_kb=499, current_size=926, largest_size=1511, smallest_size=0
metrics.log:04-16-2012 19:10:06.908 -0500 INFO Metrics - group=queue, name=aggqueue, blocked=true, max_size_kb=1024, current_size_kb=1023, current_size=3082, largest_size=3228, smallest_size=928
metrics.log:04-16-2012 19:10:06.909 -0500 INFO Metrics - group=queue, name=indexqueue, blocked=true, max_size_kb=500, current_size_kb=499, current_size=1107, largest_size=1410, smallest_size=0
metrics.log:04-16-2012 19:10:06.909 -0500 INFO Metrics - group=queue, name=typingqueue, blocked=true, max_size_kb=500, current_size_kb=499, current_size=1318, largest_size=1393, smallest_size=0
metrics.log:04-16-2012 19:10:38.579 -0500 INFO Metrics - group=queue, name=aggqueue, blocked=true, max_size_kb=1024, current_size_kb=1023, current_size=3005, largest_size=3217, smallest_size=812
metrics.log:04-16-2012 19:10:38.579 -0500 INFO Metrics - group=queue, name=indexqueue, blocked=true, max_size_kb=500, current_size_kb=499, current_size=1491, largest_size=1491, smallest_size=0
metrics.log:04-16-2012 19:10:38.579 -0500 INFO Metrics - group=queue, name=typingqueue, blocked=true, max_size_kb=500, current_size_kb=499, current_size=917, largest_size=1318, smallest_size=0
metrics.log:04-16-2012 19:12:12.240 -0500 INFO Metrics - group=queue, name=indexqueue, blocked=true, max_size_kb=500, current_size_kb=499, current_size=1137, largest_size=1464, smallest_size=0
metrics.log:04-16-2012 19:12:12.240 -0500 INFO Metrics - group=queue, name=typingqueue, blocked=true, max_size_kb=500, current_size_kb=499, current_size=1489, largest_size=1489, smallest_size=0
metrics.log:04-16-2012 19:14:17.668 -0500 INFO Metrics - group=queue, name=indexqueue, blocked=true, max_size_kb=500, current_size_kb=499, current_size=741, largest_size=1212, smallest_size=0
metrics.log:04-16-2012 19:14:17.669 -0500 INFO Metrics - group=queue, name=typingqueue, blocked=true, max_size_kb=500, current_size_kb=499, current_size=465, largest_size=1207, smallest_size=0
metrics.log:04-16-2012 19:17:57.425 -0500 INFO Metrics - group=queue, name=aggqueue, blocked=true, max_size_kb=1024, current_size_kb=1023, current_size=2570, largest_size=3562, smallest_size=734
metrics.log:04-16-2012 19:17:57.425 -0500 INFO Metrics - group=queue, name=indexqueue, blocked=true, max_size_kb=500, current_size_kb=499, current_size=1502, largest_size=1502, smallest_size=0
metrics.log:04-16-2012 19:17:57.426 -0500 INFO Metrics - group=queue, name=typingqueue, blocked=true, max_size_kb=500, current_size_kb=499, current_size=939, largest_size=1308, smallest_size=0
metrics.log:04-16-2012 19:22:39.198 -0500 INFO Metrics - group=queue, name=aggqueue, blocked=true, max_size_kb=1024, current_size_kb=1023, current_size=2991, largest_size=3361, smallest_size=810
metrics.log:04-16-2012 19:22:39.199 -0500 INFO Metrics - group=queue, name=typingqueue, blocked=true, max_size_kb=500, current_size_kb=499, current_size=767, largest_size=1472, smallest_size=0
Basically there are a series of queues that data goes through when it arrives in Splunk. First it will pass through the parsingQueue and then onto the aggQueue. After this it hits the typingQueue until it finally passes into the indexQueue and onto disk.
If the indexQueue blocks it won't take long for this to cascade back through the other queues as they also begin to block. Common causes of this is slow R/W disk access and/or high levels of logs being received from forwarders or syslog. What happens is Splunk eventually is writing data to disk as fast as it can but if it is receiving data quicker than it can write then it will begin to block.
Some quick things to check would be if your forwarders are capped at the default (I think about 256kbps) or if they have been modified to output at a faster rate. Also have a look at the IOPS of the hard disk(s) you have to ensure that they meet or exceed the minimum of at least 800 (you want more if it is in heavy use) and finally, you might just need to expand your setup to include a second indexer?
I have downloaded that but have not yet taken the time to check it out -- 'real' work gets in the way. Perhaps I need to reset some priorities so that I can make use of the product to support the product!
In light of the need for "data to justify", I'd check out the Splunk on Splunk (S.o.S.) app. It has a report called indexing_performance which can show you which portions of your Splunk pipeline are busy, and reflect that status over time so you can see whether it's a spike or a trend. Be careful, though, Splunk on Splunk comes with a couple of scripts to collect performance data; if you're already close to your limit, you'll want to disable those specific inputs via the Splunk Manager.
I have two indexers and am collecting data to justify a third -- budget considerations, etc make justification necessary and appropriate. I have a fair number of forwarders too and some can burst a bunch of data now and then. Although I have not dug into the over-all picture, I suspect that these 'bursting servers' did so enmass.