We have an older RedHat 5.6 box running the Splunk Universal Forwarder 5.0.2 processing a few directories with many *.gz files. The system seems to be keeping up well enough, but we've noticed the metrics.log has started noting a lot of "blocked=true" showing up, mainly from the "aeq" queue. Here's a sample:
[root@linux1621 splunk]# grep "name=aeq" metrics.log | tail
08-08-2014 20:01:03.681 +0000 INFO Metrics - group=queue, name=aeq, blocked=true, max_size_kb=500, current_size_kb=499, current_size=61, largest_size=63, smallest_size=0
08-08-2014 20:01:34.683 +0000 INFO Metrics - group=queue, name=aeq, blocked=true, max_size_kb=500, current_size_kb=499, current_size=61, largest_size=61, smallest_size=0
08-08-2014 20:02:05.562 +0000 INFO Metrics - group=queue, name=aeq, blocked=true, max_size_kb=500, current_size_kb=499, current_size=61, largest_size=61, smallest_size=5
08-08-2014 20:02:36.564 +0000 INFO Metrics - group=queue, name=aeq, blocked=true, max_size_kb=500, current_size_kb=499, current_size=61, largest_size=61, smallest_size=22
08-08-2014 20:03:07.565 +0000 INFO Metrics - group=queue, name=aeq, max_size_kb=500, current_size_kb=482, current_size=15, largest_size=61, smallest_size=0
08-08-2014 20:03:38.564 +0000 INFO Metrics - group=queue, name=aeq, max_size_kb=500, current_size_kb=482, current_size=15, largest_size=15, smallest_size=15
08-08-2014 20:04:09.402 +0000 INFO Metrics - group=queue, name=aeq, max_size_kb=500, current_size_kb=0, current_size=0, largest_size=61, smallest_size=0
08-08-2014 20:04:40.403 +0000 INFO Metrics - group=queue, name=aeq, blocked=true, max_size_kb=500, current_size_kb=499, current_size=61, largest_size=61, smallest_size=0
08-08-2014 20:05:11.403 +0000 INFO Metrics - group=queue, name=aeq, blocked=true, max_size_kb=500, current_size_kb=499, current_size=61, largest_size=61, smallest_size=1
08-08-2014 20:05:42.404 +0000 INFO Metrics - group=queue, name=aeq, blocked=true, max_size_kb=500, current_size_kb=499, current_size=61, largest_size=61, smallest_size=7
These seem to correlate with errors seen in the splunkd.log file:
08-08-2014 20:05:42.836 +0000 INFO BatchReader - Continuing...
08-08-2014 20:05:43.044 +0000 INFO BatchReader - Could not send data to output queue (parsingQueue), retrying...
08-08-2014 20:05:43.708 +0000 INFO BatchReader - Continuing...
08-08-2014 20:05:44.057 +0000 INFO BatchReader - Could not send data to output queue (parsingQueue), retrying...
08-08-2014 20:05:44.394 +0000 INFO BatchReader - Continuing...
08-08-2014 20:05:45.363 +0000 INFO BatchReader - Could not send data to output queue (parsingQueue), retrying...
08-08-2014 20:05:46.339 +0000 INFO BatchReader - Continuing...
08-08-2014 20:05:47.939 +0000 INFO BatchReader - Could not send data to output queue (parsingQueue), retrying...
08-08-2014 20:05:48.251 +0000 INFO BatchReader - Continuing...
08-08-2014 20:05:48.459 +0000 INFO BatchReader - Could not send data to output queue (parsingQueue), retrying...
I tried increasing the maxKBps in limits.conf (doubled it from 1024 to 2048), but the errors returned right after restart.
The CPU and RAM on this system are doing quite well - system load is below 1.00 most of the time, and RAM is mostly buffers and not swapping.
What is "aeq" and where are it's parameters adjusted? Can we increase the max_size_kb (presumably to 1024)?
Or is this a red herring and we need to look elsewhere?
... View more