Getting Data In

Events delay

bosseres
Contributor

Hello, Team!

I see delays in the receipt of events in the indexes. Events are collected by SplunkForwarder agents. In the case of a complete absence of events, restarting agents helps, but if there is a delay in the arrival of events, restarting agents does not help.

Events goes to HFs, then to indexers.

On splunk universal forwarders such errors in splunkd.log

WARN TailReader [282099 tailreader0] - Could not send data to output queue (parsingQueue), retrying...

in metrics.log

+0300 INFO HealthChangeReporter - feature="Large and Archive File Reader-0" indicator="data_out_rate" previous_color=green color=yellow due_to_threshold_value=1 measured_value=1 reason="The monitor input cannot produce data because splunkd's processing queues are full. This will be caused by inadequate indexing or forwarding rate, or a sudden burst of incoming data."

 

where is problem? on splunk universal forwarder or on heavy forwarder?

what to look?

Labels (2)
0 Karma

PickleRick
SplunkTrust
SplunkTrust

Are you sure you're not hitting the default (relatively low) thruput limit?

0 Karma

bosseres
Contributor

tried to set on forwarders, but didn't help

[queue=parsingQueue]

maxSize = 64MB

 

0 Karma

PickleRick
SplunkTrust
SplunkTrust

That's a different parameter. This one just raises the queue size limit but the thruput limit still caps the speed of outgoing events.

Verify the output of

splunk btool limits list thruput | grep maxKbps

The default value of 256 is relatively low and might cause bottleneck.

0 Karma

isoutamo
SplunkTrust
SplunkTrust

One way to look top blocking queues

index=_internal splunk_server= <your indexer nodes> sourcetype=splunkd TERM(group=queue) (TERM(name=parsingQueue) OR TERM(name=indexqueue) OR TERM(name=tcpin_queue) OR TERM(name=aggqueue))
| eval is_blocked=if(blocked=="true",1,0), host_queue=host." - ".name
| stats sparkline sum(is_blocked) as blocked,count by host_queue
| eval blocked_ratio=round(blocked/count*100,2)
| where blocked_ratio > 0
| sort 50 -blocked_ratio
| eval requires_attention=case(blocked_ratio>50.0,"fix highly recommended!",blocked_ratio>40.0,"you better check..",blocked_ratio>20.0,"usually no need to worry but keep an eye on it",1=1,"not unusual")

and look if UF's limit has reached

index=_internal sourcetype=splunkd component=ThruputProcessor "current data throughput" 
| rex "Current data throughput \((?<kb>\S+)" 
| eval rate=case(kb < 500, "256", kb > 499 AND kb < 520, "512", kb > 520 AND kb < 770 ,"768", kb>771 AND kb<1210, "1024", 1=1, ">1024") 
| stats count as Count sparkline as Trend by host, rate 
| where Count > 4 
| rename host as "Host" rate as "Throughput rate(kb)" count as "Hit Count"
| sort -"Throughput rate(kb)",-Count

 But as I earlier said, it's much easier to look these from MC.

0 Karma

richgalloway
SplunkTrust
SplunkTrust

The log message gives a few possible reasons.  It's possible the indexers or HF are not processing data fast enough.  Verify the servers exceed Splunk's hardware requirements and that the indexers are writing to fast storage.

Consider having the UF send directly to the indexers to see if the problem is with the HF.

Verify there are no network issues periodically preventing the instances from connecting to each other.

---
If this reply helps you, Karma would be appreciated.
0 Karma

bosseres
Contributor

is it possible that problem is on indexers, though splunk forwarders send data to heavy forwarders for parsing and then to indexers for indexing?

0 Karma

isoutamo
SplunkTrust
SplunkTrust

Yes it's possible. For that reason them basic procedure to get that information is start from indexers and check what is situation on their pipelines and queues. You can easily see that from MC or queries which are presented on that conf presentation.

0 Karma

isoutamo
SplunkTrust
SplunkTrust

HI

quite often the reason is on IDX side when they couldn't write enough fast on disk. Of course if you have some massive transforms etc. then those could be the reason.

You could figure out the real reason by starting from indexer side and look queue by pipeline/queue and usually the first one which usage is like 90-100% is there guilty one. Easiest way this is done by MC. I suppose that you have configured central MC for your distributed environment? If not, then now is time to do it. Also you should/could add your HF to it like indexers and add separate groups for real indexers and HF to make easier to look what happening on which layer.

Just select Settings -> MC -> Indexing -> performance and you see quite easily what is situation on different layers and queues. 

Here is one old conf presentation to look this from internal indexes / log files.

r. Ismo

 

0 Karma
Get Updates on the Splunk Community!

Earn a $35 Gift Card for Answering our Splunk Admins & App Developer Survey

Survey for Splunk Admins and App Developers is open now! | Earn a $35 gift card!      Hello there,  Splunk ...

Continuing Innovation & New Integrations Unlock Full Stack Observability For Your ...

You’ve probably heard the latest about AppDynamics joining the Splunk Observability portfolio, deepening our ...

Monitoring Amazon Elastic Kubernetes Service (EKS)

As we’ve seen, integrating Kubernetes environments with Splunk Observability Cloud is a quick and easy way to ...