Processing queues blocking when thruput/event volu...

cwhelan · ‎06-30-2023

Hi guys,

I am currently seeing that processing queues on one of my heavy forwarders appear to be blocking during a 6 hour period at night time when log volume being ingested is much lower (during this period, log volume ingested drops from 10 million to under 3 million events).

Are there are obvious reasons as to why queue block ratios would increase at same time that thruput/event volume decreases? As I'm guessing that the opposite would generally be expected?

We can see that block ratios increase at 1:00 AM below:

index=_internal source=*metrics.log group=queue blocked=true host=HF max_size_kb>0
| timechart span=30m@m count by name

At the same time we can see thruput decrease below:

index=_internal sourcetype=splunkd host=HF group=per_sourcetype_thruput series=*
| timechart sum(kb) by series

Current queue config and context over last 24h:

I have also noticed a lot of 'Could not send data to output queue (parsingQueue), retrying' errors around the same time:

index=_internal host=HFsource=*splunkd* log_level=ERROR OR log_level=WARN event_message="Could not send data to output queue (parsingQueue), retrying..."
| timechart span=5m@m count by event_message

I would appreciate any answers around why queue block ratios would increase at same time that thruput/event volume decreases and also any solutions for getting average queue block ratios as close to 0 as possible - queues currently appear to be blocking throughout the day with highest block ratios occuring 1:00 - 8:00 AM.

isoutamo · ‎06-30-2023

Hi

I will start to look what is happening on your indexers sides. Are there any heavy backups or other activities which cause heavy IOPS on OS level or your storage system level?

r. Ismo

cwhelan · ‎06-30-2023

Hi isoutamo,

Thanks a lot for the reply. Currently, I only have backend access to all of our HFs and front end console, as everything else is in Splunk Cloud.

Would you suggest I open a Splunk support case? Or is there anything else I can do to try remediate beforehand?

Cheers,

C

isoutamo · ‎06-30-2023

Unfortunately that information isn't in CMC. You could see from it if there is some additional load on that time or something other suspicious events?

I'm not sure if you could access _introspection index and are there all needed data, but you could try something like this to see IOPS and wait times for your SC instance

index=_introspection sourcetype=splunk_resource_usage component=IOStats host="idx-*.<your stack name here>.splunkcloud.com"
| eval mount_point = 'data.mount_point' 
| eval reads_ps = 'data.reads_ps' 
| eval writes_ps = 'data.writes_ps' 
| eval interval = 'data.interval' 
| eval op_count = (reads_ps + writes_ps) * interval 
| eval avg_service_ms = 'data.avg_service_ms' 
| eval avg_wait_ms = 'data.avg_total_ms' 
| eval cpu_pct = 'data.cpu_pct' 
| eval network_pct = 'data.network_pct'
| search mount_point = "/opt" 
| timechart minspan=60s partial=f per_second(op_count) as iops, avg(data.cpu_pct) as avg_cpu_pct, avg(data.avg_service_ms) as avg_service_ms, avg(data.avg_total_ms) as avg_wait_ms, avg(data.network_pct) as avg_network_pct
| eval iops = round(iops) 
| eval avg_cpu_pct = round(avg_cpu_pct) 
| eval avg_service_ms = round(avg_service_ms) 
| eval avg_wait_ms = round(avg_wait_ms) 
| eval avg_network_pct = round(avg_network_pct) 
| fields _time, iops avg_wait_ms 
| rename avg_wait_ms as "Wait Time"

One way to get "correct" queries is use your local splunk instance and its MC. Just look which dashboard gives you that information which you like to see and then copy that query and modify it as needed before run it on SC. Remember that you can expand macros with Ctrl/Cmd+E combination!

Processing queues blocking when thruput/event volume decreases

forwarder management

Thanks for the Memories! Splunk University, .conf25, and our Community

Data Persistence in the OpenTelemetry Collector

Introducing Splunk 10.0: Smarter, Faster, and More Powerful Than Ever

Are you a member of the Splunk Community?

Processing queues blocking when thruput/event volume decreases

forwarder management

Thanks for the Memories! Splunk University, .conf25, and our Community

Data Persistence in the OpenTelemetry Collector

Introducing Splunk 10.0: Smarter, Faster, and More Powerful Than Ever