I noticed on my splunk instance that I am getting messages like these:
02-07-2020 15:20:36.038 -0500 INFO Metrics - group=queue, name=typingqueue, blocked=true, max_size_kb=500, current_size_kb=499, current_size=993, largest_size=993, smallest_size=993 02-07-2020 15:21:35.038 -0500 INFO Metrics - group=queue, name=aggqueue, blocked=true, max_size_kb=1024, current_size_kb=1023, current_size=2035, largest_size=2035, smallest_size=2035 02-07-2020 15:21:35.038 -0500 INFO Metrics - group=queue, name=auditqueue, blocked=true, max_size_kb=500, current_size_kb=499, current_size=809, largest_size=809, smallest_size=809 02-07-2020 15:21:35.038 -0500 INFO Metrics - group=queue, name=indexqueue, blocked=true, max_size_kb=500, current_size_kb=499, current_size=998, largest_size=998, smallest_size=998 02-07-2020 15:21:35.038 -0500 INFO Metrics - group=queue, name=parsingqueue, blocked=true, max_size_kb=6144, current_size_kb=6143, current_size=99, largest_size=99, smallest_size=99 02-07-2020 15:21:35.038 -0500 INFO Metrics - group=queue, name=splunktcpin, blocked=true, max_size_kb=500, current_size_kb=499, current_size=995, largest_size=995, smallest_size=995
How can I resolve this?
Based on your screenshot, you have multiple compounding issues.
You need to disable Transparent Huge Pages:
Your ulimits are not set correctly and need to be increased:
Your system resources are below the recommendation, which usually means you're running on VMWare.
If correcting the first two issues does not ease the congestion, you may want to consider increasing the parallel ingestion pipelines.
I noticed under netstat -tulpn, 9997 is not listening, as is defined under settings -> receive data. I disabled the receiver (which failed), then received a similar error when re-enabling:
Error occurred attempting to enable 9997: .
Queue messages Queue messages look like ... group=queue, name=parsingqueue, max_size=1000, filled_count=0, empty_count=8, current_size=0, largest_size=2, smallest_size=0 Most of these values are not interesting. But current_size, especially considered in aggregate, across events, can tell you which portions of Splunk indexing are the bottlenecks. If current_size remains near zero, then probably the indexing system is not being taxed in any way. If the queues remain near 1000, then more data is being fed into the system (at the time) than it can process in total. Sometimes you will see messages such as ... group=queue, name=parsingqueue, blocked!!=true, max_size=1000, filled_count=0, empty_count=8, current_size=0, largest_size=2, smallest_size=0 This message contains the blocked string, indicating that it was full, and someone tried to add more, and couldn't. A queue becomes unblocked as soon as the code pulling items out of it pulls an item. Many blocked queue messages in a sequence indicate that data is not flowing at all for some reason. A few scattered blocked messages indicate that flow control is operating, and is normal for a busy indexer. If you want to look at the queue data in aggregate, graphing the average of current_size is probably a good starting point. There are queues in place for data going into the parsing pipeline, and for data between parsing and indexing. Each networking output also has its own queue, which can be useful to determine whether the data is able to be sent promptly, or alternatively whether there's some network or receiving system limitation.
It comes out because the size of metric is 500kb or more.