Hi All,
I have 4 Heavy forwarder servers sending data through 5 indexers
server1 acts as syslog server which has autoLBFrequency as 10 and maxQueueSize as 1000MB
server2 acts as syslog and heavy forwarder which has autoLBFrequency as 10 and maxQueueSize as 500MB
server3 acts heavy forwarder which has autoLBFrequency as 10 and maxQueueSize as 500MB
server4 acts heavy forwarder which has autoLBFrequency as 10 and maxQueueSize as 500MB
Receiving blocked=true in metrics.log while syslog/heavy forwarder trying to send data through indexer servers. Due to this index ingestion is getting delayed and data is coming to Splunk 2-3 hours late.
And in one of the 5 indexer servers CPU is always highly utilized from 99-100% consistently which has 24 CPU, other indexer servers also running with 24 CPU.
Planning to upgrade highly utilized indexer server alone from 24 to 32
Kindly suggest by updating below in outputs.conf will reduce/stop the "blocked=true" in metrics.log and CPU load on indexer will be normal before upgrading the CPU.
OR we need to do both, changes in outputs.conf and upgrading the CPU. If both can be done which is the first we can try. Kindly help.
autoLBFrequency = 5
maxQueueSize = 1000MB
aggQueueSize = 7000
outputQueueSize = 7000
as per the monitoring console could see indexing queue and splunktcpin queue is high
Further to my last reply - there are also a couple of worthwhile resources here which give an overview of how to identify and deal with blocked queues.
https://docs.splunk.com/Documentation/Splunk/8.2.4/Deploy/Datapipeline
🌟 Did this answer help you? If so, please consider:
Your feedback encourages the volunteers in this community to continue contributing
Thanks @livehybrid for your inputs, I did checked the blocked=true for all the 4 heavy forwarder and could see in one of the heavy forwarder which acts as syslog server collecting the network related data where the typing queue is found which is considered as a bottleneck as i went through the PDF.
And as per the PDF i grepped the metrics.log to see which sourcetype and host consuming more CPU
04-22-2025 05:19:58.017 +0700 INFO Metrics - group=per_sourcetype_regex_cpu, series="cp_log", cpu=604, cpupe=0.0005149352537121802, bytes=1072305900, ev=1172963
04-22-2025 05:19:58.011 +0700 INFO Metrics - group=per_host_regex_cpu, series="networkserver", cpu=596, cpupe=0.0005081981051714273, bytes=1072185809, ev=1172771
Kindly let me know what to do next
Hi
Increasing autoLBFrequency, maxQueueSize, aggQueueSize, or outputQueueSize in outputs.conf on your heavy forwarders may help temporarily reduce "blocked=true" messages, but these settings do not address the root cause: your indexer(s) are overloaded and unable to keep up with incoming data.
The following will tell you which queues are blocking on which servers:
index=_internal source=*metrics.log blocked=true | stats count by host, group, name
Do not rely solely on queue size increases; this can delay but not prevent data loss if indexers remain overloaded.
Investigate why one indexer is overloaded (check for hot buckets, network issues, or misconfigured load balancing). Understanding *why* the single indexer is blocking is probably the important thing here - it could be a number of things but likely to be either resource issue (e.g. faulty disk) or one of your syslog feeds failing to balance to another indexer.
Is it always the same indexer that runs hot? Or does it change?
🌟 Did this answer help you? If so, please consider:
Your feedback encourages the volunteers in this community to continue contributing