I am seeing a lot of blocking on my three indexers, in the range of 500-1000 a day per host. The heaviest is indexqueue and typingqueue, followed by aggqueue. splunktcpin is in the double-digit range.
The indexes are striped across all three indexers. I'm at a loss on where to begin looking, anyone have this issue with blocking on their Splunk indexers?
I believe I've identified the root cause as slow disk for the COLD_DB. Our configuration is to have the hot/warm DBs on local attached (virtually, anyway) disks, and point the cold_dbs to a CIFs share on a NetApp... so index.conf looks something like this...
[databases]
coldPath = \netapp\splunk\SplunkIndex02\DATA_2\databases\colddb
homePath = F:\CustomIndex\DATA_2\databases\db
thawedPath = \netapp\splunk\SplunkIndex02\DATA_2\databases\thaweddb
maxWarmDBCount = 32
So, I created another locally attached drive, and used it as the coldpath on ONE of the three indexers we have. After 4 hours, we have not seen ANY blocking on the indexer with the "locally attached" drive, while the other indexers continue to see blocking at the same rate as before. In this particular case, the slow disk was the cold db. If there a way to have splunk roll the files to cold on a schedule, rather than constantly.. this would not be a problem..
Yeah, I've had this happen. How many GB is each indexer handling daily? A safe number is 100GB.
More Info... I looked at a low indexing volume time (800MB/Indexer) and we still saw 28 indexqueue blocking event...
Aprox 30GBs/day... However, this even happens at substancially lower indexing volumes..
have you try increasing the queue maxSize in splunk/etc/system/local/
server.conf:
##########################################################################################
# Queue settings
##########################################################################################
[queue]
maxSize = [<integer>|<integer>[KB|MB|GB]]
* Specifies default capacity of a queue.
* If specified as a lone integer (for example, maxSize=1000), maxSize indicates the maximum number of events allowed
in the queue.
* If specified as an integer followed by KB, MB, or GB (for example, maxSize=100MB), it indicates the maximum
RAM allocated for queue.
*** The default is 500KB.**
[queue=<queueName>]
maxSize = [<integer>|<integer>[KB|MB|GB]]
* Specifies the capacity of a queue. It overrides the default capacity specified in [queue].
* If specified as a lone integer (for example, maxSize=1000), maxSize indicates the maximum number of events allowed
in the queue.
* If specified as an integer followed by KB, MB, or GB (for example, maxSize=100MB), it indicates the maximum
RAM allocated for queue.
* The default is inherited from maxSize value specified in [queue]
More Info... I looked at a low indexing volume time (800MB/Indexer) and we still saw 28 indexqueue blocking event...
The interesting part is that if you look at disk queueing, disk response times and IOPs, there is not not much to indicate a disk bottleneck... Queueing is less than 1, RT is sub 20ms, and IOPS are less than 100... We tested the disks before installing splunk and we were able to reach upwards of 3000 IOPS... Of note.. these machines are virtualized, but are not sharing resources with other servers.. essentially dedicated from a Server AND SAN perspective...
Seeing that many messages a day, I would be concerned that the larger queue size would just delay the issue, since it seems it isn't getting the data output to disk quickly enough.
Thanks! I bumped indexqueue to 2000 and will look into increasing any others.