I see delays in the receipt of events in the indexes. Events are collected by SplunkForwarder agents. In the case of a complete absence of events, restarting agents helps, but if there is a delay in the arrival of events, restarting agents does not help.
Events goes to HFs, then to indexers.
On splunk universal forwarders such errors in splunkd.log
WARN TailReader [282099 tailreader0] - Could not send data to output queue (parsingQueue), retrying...
+0300 INFO HealthChangeReporter - feature="Large and Archive File Reader-0" indicator="data_out_rate" previous_color=green color=yellow due_to_threshold_value=1 measured_value=1 reason="The monitor input cannot produce data because splunkd's processing queues are full. This will be caused by inadequate indexing or forwarding rate, or a sudden burst of incoming data."
where is problem? on splunk universal forwarder or on heavy forwarder?
what to look?
That's a different parameter. This one just raises the queue size limit but the thruput limit still caps the speed of outgoing events.
Verify the output of
splunk btool limits list thruput | grep maxKbps
The default value of 256 is relatively low and might cause bottleneck.
One way to look top blocking queues
index=_internal splunk_server= <your indexer nodes> sourcetype=splunkd TERM(group=queue) (TERM(name=parsingQueue) OR TERM(name=indexqueue) OR TERM(name=tcpin_queue) OR TERM(name=aggqueue)) | eval is_blocked=if(blocked=="true",1,0), host_queue=host." - ".name | stats sparkline sum(is_blocked) as blocked,count by host_queue | eval blocked_ratio=round(blocked/count*100,2) | where blocked_ratio > 0 | sort 50 -blocked_ratio | eval requires_attention=case(blocked_ratio>50.0,"fix highly recommended!",blocked_ratio>40.0,"you better check..",blocked_ratio>20.0,"usually no need to worry but keep an eye on it",1=1,"not unusual")
and look if UF's limit has reached
index=_internal sourcetype=splunkd component=ThruputProcessor "current data throughput" | rex "Current data throughput \((?<kb>\S+)" | eval rate=case(kb < 500, "256", kb > 499 AND kb < 520, "512", kb > 520 AND kb < 770 ,"768", kb>771 AND kb<1210, "1024", 1=1, ">1024") | stats count as Count sparkline as Trend by host, rate | where Count > 4 | rename host as "Host" rate as "Throughput rate(kb)" count as "Hit Count" | sort -"Throughput rate(kb)",-Count
But as I earlier said, it's much easier to look these from MC.
The log message gives a few possible reasons. It's possible the indexers or HF are not processing data fast enough. Verify the servers exceed Splunk's hardware requirements and that the indexers are writing to fast storage.
Consider having the UF send directly to the indexers to see if the problem is with the HF.
Verify there are no network issues periodically preventing the instances from connecting to each other.
Yes it's possible. For that reason them basic procedure to get that information is start from indexers and check what is situation on their pipelines and queues. You can easily see that from MC or queries which are presented on that conf presentation.
quite often the reason is on IDX side when they couldn't write enough fast on disk. Of course if you have some massive transforms etc. then those could be the reason.
You could figure out the real reason by starting from indexer side and look queue by pipeline/queue and usually the first one which usage is like 90-100% is there guilty one. Easiest way this is done by MC. I suppose that you have configured central MC for your distributed environment? If not, then now is time to do it. Also you should/could add your HF to it like indexers and add separate groups for real indexers and HF to make easier to look what happening on which layer.
Just select Settings -> MC -> Indexing -> performance and you see quite easily what is situation on different layers and queues.
Here is one old conf presentation to look this from internal indexes / log files.