Monitoring Splunk

Processing queues blocking when thruput/event volume decreases

cwhelan
Explorer

Hi guys,

I am currently seeing that processing queues on one of my heavy forwarders appear to be blocking during a 6 hour period at night time when log volume being ingested is much lower (during this period, log volume ingested drops from 10 million to under 3 million events).

Are there are obvious reasons as to why queue block ratios would increase at same time that thruput/event volume decreases? As I'm guessing that the opposite would generally be expected?

We can see that block ratios increase at 1:00 AM below:

index=_internal source=*metrics.log group=queue blocked=true host=HF max_size_kb>0
| timechart span=30m@m count by name

Screenshot 2023-06-30 at 09.29.33.png

 

At the same time we can see thruput decrease below:

index=_internal sourcetype=splunkd host=HF group=per_sourcetype_thruput series=*
| timechart sum(kb) by series

Screenshot 2023-06-30 at 09.30.12.png

Screenshot 2023-06-30 at 10.10.49.png

 

Current queue config and context over last 24h:

Screenshot 2023-06-30 at 09.34.50.png

Screenshot 2023-06-30 at 09.49.00.png

 

I have also noticed a lot of 'Could not send data to output queue (parsingQueue), retrying' errors around the same time:

index=_internal host=HFsource=*splunkd* log_level=ERROR OR log_level=WARN event_message="Could not send data to output queue (parsingQueue), retrying..."
| timechart span=5m@m count by event_message

Screenshot 2023-06-30 at 09.53.37.png

 

 

 

 

I would appreciate any answers around why queue block ratios would increase at same time that thruput/event volume decreases and also any solutions for getting average queue block ratios as close to 0 as possible - queues currently appear to be blocking throughout the day with highest block ratios occuring 1:00 - 8:00 AM.

Labels (1)
0 Karma

isoutamo
SplunkTrust
SplunkTrust

Hi

I will start to look what is happening on your indexers sides. Are there any heavy backups or other activities which cause heavy IOPS on OS level or your storage system level?

r. Ismo

cwhelan
Explorer

Hi isoutamo,

Thanks a lot for the reply. Currently, I only have backend access to all of our HFs and front end console, as everything else is in Splunk Cloud.

 

Would you suggest I open a Splunk support case? Or is there anything else I can do to try remediate beforehand?

 

Cheers,

C

0 Karma

isoutamo
SplunkTrust
SplunkTrust

Unfortunately that information isn't in CMC. You could see from it if there is some additional load on that time or something other suspicious events?

I'm not sure if you could access _introspection index and are there all needed data, but you could try something like this to see IOPS and wait times for your SC instance

 

index=_introspection sourcetype=splunk_resource_usage component=IOStats host="idx-*.<your stack name here>.splunkcloud.com"
| eval mount_point = 'data.mount_point' 
| eval reads_ps = 'data.reads_ps' 
| eval writes_ps = 'data.writes_ps' 
| eval interval = 'data.interval' 
| eval op_count = (reads_ps + writes_ps) * interval 
| eval avg_service_ms = 'data.avg_service_ms' 
| eval avg_wait_ms = 'data.avg_total_ms' 
| eval cpu_pct = 'data.cpu_pct' 
| eval network_pct = 'data.network_pct'
| search mount_point = "/opt" 
| timechart minspan=60s partial=f per_second(op_count) as iops, avg(data.cpu_pct) as avg_cpu_pct, avg(data.avg_service_ms) as avg_service_ms, avg(data.avg_total_ms) as avg_wait_ms, avg(data.network_pct) as avg_network_pct
| eval iops = round(iops) 
| eval avg_cpu_pct = round(avg_cpu_pct) 
| eval avg_service_ms = round(avg_service_ms) 
| eval avg_wait_ms = round(avg_wait_ms) 
| eval avg_network_pct = round(avg_network_pct) 
| fields _time, iops avg_wait_ms 
| rename avg_wait_ms as "Wait Time"

 

 One way to get "correct" queries is use your local splunk instance and its MC. Just look which dashboard gives you that information which you like to see and then copy that query and modify it as needed before run it on SC. Remember that you can expand macros with Ctrl/Cmd+E combination!

Career Survey
First 500 qualified respondents will receive a $20 gift card! Tell us about your professional Splunk journey.
Get Updates on the Splunk Community!

Thanks for the Memories! Splunk University, .conf25, and our Community

Thank you to everyone in the Splunk Community who joined us for .conf25, which kicked off with our iconic ...

Data Persistence in the OpenTelemetry Collector

This blog post is part of an ongoing series on OpenTelemetry. What happens if the OpenTelemetry collector ...

Introducing Splunk 10.0: Smarter, Faster, and More Powerful Than Ever

Now On Demand Whether you're managing complex deployments or looking to future-proof your data ...