We have an older RedHat 5.6 box running the Splunk Universal Forwarder 5.0.2 processing a few directories with many *.gz files. The system seems to be keeping up well enough, but we've noticed the metrics.log has started noting a lot of "blocked=true" showing up, mainly from the "aeq" queue. Here's a sample:
[root@linux1621 splunk]# grep "name=aeq" metrics.log | tail
08-08-2014 20:01:03.681 +0000 INFO Metrics - group=queue, name=aeq, blocked=true, max_size_kb=500, current_size_kb=499, current_size=61, largest_size=63, smallest_size=0
08-08-2014 20:01:34.683 +0000 INFO Metrics - group=queue, name=aeq, blocked=true, max_size_kb=500, current_size_kb=499, current_size=61, largest_size=61, smallest_size=0
08-08-2014 20:02:05.562 +0000 INFO Metrics - group=queue, name=aeq, blocked=true, max_size_kb=500, current_size_kb=499, current_size=61, largest_size=61, smallest_size=5
08-08-2014 20:02:36.564 +0000 INFO Metrics - group=queue, name=aeq, blocked=true, max_size_kb=500, current_size_kb=499, current_size=61, largest_size=61, smallest_size=22
08-08-2014 20:03:07.565 +0000 INFO Metrics - group=queue, name=aeq, max_size_kb=500, current_size_kb=482, current_size=15, largest_size=61, smallest_size=0
08-08-2014 20:03:38.564 +0000 INFO Metrics - group=queue, name=aeq, max_size_kb=500, current_size_kb=482, current_size=15, largest_size=15, smallest_size=15
08-08-2014 20:04:09.402 +0000 INFO Metrics - group=queue, name=aeq, max_size_kb=500, current_size_kb=0, current_size=0, largest_size=61, smallest_size=0
08-08-2014 20:04:40.403 +0000 INFO Metrics - group=queue, name=aeq, blocked=true, max_size_kb=500, current_size_kb=499, current_size=61, largest_size=61, smallest_size=0
08-08-2014 20:05:11.403 +0000 INFO Metrics - group=queue, name=aeq, blocked=true, max_size_kb=500, current_size_kb=499, current_size=61, largest_size=61, smallest_size=1
08-08-2014 20:05:42.404 +0000 INFO Metrics - group=queue, name=aeq, blocked=true, max_size_kb=500, current_size_kb=499, current_size=61, largest_size=61, smallest_size=7
These seem to correlate with errors seen in the splunkd.log file:
08-08-2014 20:05:42.836 +0000 INFO BatchReader - Continuing...
08-08-2014 20:05:43.044 +0000 INFO BatchReader - Could not send data to output queue (parsingQueue), retrying...
08-08-2014 20:05:43.708 +0000 INFO BatchReader - Continuing...
08-08-2014 20:05:44.057 +0000 INFO BatchReader - Could not send data to output queue (parsingQueue), retrying...
08-08-2014 20:05:44.394 +0000 INFO BatchReader - Continuing...
08-08-2014 20:05:45.363 +0000 INFO BatchReader - Could not send data to output queue (parsingQueue), retrying...
08-08-2014 20:05:46.339 +0000 INFO BatchReader - Continuing...
08-08-2014 20:05:47.939 +0000 INFO BatchReader - Could not send data to output queue (parsingQueue), retrying...
08-08-2014 20:05:48.251 +0000 INFO BatchReader - Continuing...
08-08-2014 20:05:48.459 +0000 INFO BatchReader - Could not send data to output queue (parsingQueue), retrying...
I tried increasing the maxKBps in limits.conf (doubled it from 1024 to 2048), but the errors returned right after restart.
The CPU and RAM on this system are doing quite well - system load is below 1.00 most of the time, and RAM is mostly buffers and not swapping.
What is "aeq" and where are it's parameters adjusted? Can we increase the max_size_kb (presumably to 1024)?
Or is this a red herring and we need to look elsewhere?
In your server.conf add the following: (This setting adjusts all queues)
[queue]
maxSize = 500MB # this can be KB,MB, or GB.
You can also specify specific queues, but I was unable to get it to work. See server.conf spec.
Thanks, it looks like the "aeq" queue is a single-threaded process that decompresses .GZ files, and that's all this system is doing. We're investigating sending uncompressed log files to see if that helps. I'll update when we know more.
aeq appears to feed the archiveProcessor: http://wiki.splunk.com/Community:HowIndexingWorks
That's just a symptom though, of a bottleneck somewhere down the line. As you can see from the diagrams, aeq is right at the top. Look for the bottommost queue that's blocked and you have your culprit.
Hitting the thruput in limits.conf has a dedicated event, so you should see that if you're indeed hitting that.