Events delay

bosseres · ‎05-28-2023

Hello, Team!

I see delays in the receipt of events in the indexes. Events are collected by SplunkForwarder agents. In the case of a complete absence of events, restarting agents helps, but if there is a delay in the arrival of events, restarting agents does not help.

Events goes to HFs, then to indexers.

On splunk universal forwarders such errors in splunkd.log

WARN TailReader [282099 tailreader0] - Could not send data to output queue (parsingQueue), retrying...

in metrics.log

+0300 INFO HealthChangeReporter - feature="Large and Archive File Reader-0" indicator="data_out_rate" previous_color=green color=yellow due_to_threshold_value=1 measured_value=1 reason="The monitor input cannot produce data because splunkd's processing queues are full. This will be caused by inadequate indexing or forwarding rate, or a sudden burst of incoming data."

where is problem? on splunk universal forwarder or on heavy forwarder?

what to look?

PickleRick · ‎05-28-2023

Are you sure you're not hitting the default (relatively low) thruput limit?

bosseres · ‎05-28-2023

tried to set on forwarders, but didn't help

[queue=parsingQueue]

maxSize = 64MB

PickleRick · ‎05-29-2023

That's a different parameter. This one just raises the queue size limit but the thruput limit still caps the speed of outgoing events.

Verify the output of

splunk btool limits list thruput | grep maxKbps

The default value of 256 is relatively low and might cause bottleneck.

isoutamo · ‎05-29-2023

One way to look top blocking queues

index=_internal splunk_server= <your indexer nodes> sourcetype=splunkd TERM(group=queue) (TERM(name=parsingQueue) OR TERM(name=indexqueue) OR TERM(name=tcpin_queue) OR TERM(name=aggqueue))
| eval is_blocked=if(blocked=="true",1,0), host_queue=host." - ".name
| stats sparkline sum(is_blocked) as blocked,count by host_queue
| eval blocked_ratio=round(blocked/count*100,2)
| where blocked_ratio > 0
| sort 50 -blocked_ratio
| eval requires_attention=case(blocked_ratio>50.0,"fix highly recommended!",blocked_ratio>40.0,"you better check..",blocked_ratio>20.0,"usually no need to worry but keep an eye on it",1=1,"not unusual")

and look if UF's limit has reached

index=_internal sourcetype=splunkd component=ThruputProcessor "current data throughput" 
| rex "Current data throughput \((?<kb>\S+)" 
| eval rate=case(kb < 500, "256", kb > 499 AND kb < 520, "512", kb > 520 AND kb < 770 ,"768", kb>771 AND kb<1210, "1024", 1=1, ">1024") 
| stats count as Count sparkline as Trend by host, rate 
| where Count > 4 
| rename host as "Host" rate as "Throughput rate(kb)" count as "Hit Count"
| sort -"Throughput rate(kb)",-Count

But as I earlier said, it's much easier to look these from MC.

richgalloway · ‎05-28-2023

The log message gives a few possible reasons. It's possible the indexers or HF are not processing data fast enough. Verify the servers exceed Splunk's hardware requirements and that the indexers are writing to fast storage.

Consider having the UF send directly to the indexers to see if the problem is with the HF.

Verify there are no network issues periodically preventing the instances from connecting to each other.

---
If this reply helps you, Karma would be appreciated.

bosseres · ‎05-28-2023

is it possible that problem is on indexers, though splunk forwarders send data to heavy forwarders for parsing and then to indexers for indexing?

isoutamo · ‎05-28-2023

Yes it's possible. For that reason them basic procedure to get that information is start from indexers and check what is situation on their pipelines and queues. You can easily see that from MC or queries which are presented on that conf presentation.

isoutamo · ‎05-28-2023

HI

quite often the reason is on IDX side when they couldn't write enough fast on disk. Of course if you have some massive transforms etc. then those could be the reason.

You could figure out the real reason by starting from indexer side and look queue by pipeline/queue and usually the first one which usage is like 90-100% is there guilty one. Easiest way this is done by MC. I suppose that you have configured central MC for your distributed environment? If not, then now is time to do it. Also you should/could add your HF to it like indexers and add separate groups for real indexers and HF to make easier to look what happening on which layer.

Just select Settings -> MC -> Indexing -> performance and you see quite easily what is situation on different layers and queues.

Here is one old conf presentation to look this from internal indexes / log files.

r. Ismo

Events delay

heavy forwarder

universal forwarder

Cloud Platform & Enterprise: Classic Dashboard Export Feature Deprecation

Explore the Latest Educational Offerings from Splunk (November Releases)

New This Month in Splunk Observability Cloud - Metrics Usage Analytics, Enhanced K8s ...