Monitoring Splunk

IndexingQueue gets full on Indexers and forwarders not able to connect to it

sylim_splunk
Splunk Employee
Splunk Employee

Indexers are getting blocked periodically throughout the day, causing our heavy forwarders to stop forwarding data.

-- Error on HF --
07-25-2022 09:22:55.982 -0400 WARN AutoLoadBalancedConnectionStrategy [5212 TcpOutEloop] - Cooked connection to ip=10.3.13.4:9997 timed out
07-25-2022 09:27:14.858 -0400 WARN TcpOutputFd [5212 TcpOutEloop] - Connect to 10.3.13.4:9997 failed. Connection refused
07-25-2022 10:58:06.973 -0400 WARN TcpOutputFd [5034 TcpOutEloop] - Connect to 10.3.13.4:9997 failed. Connection refused

Whenever this was found on a HF the indexer with the IP had closed the port 9997 and indexer's queues were full.

 

-- Log messages on the indexer:

07-26-2022 08:51:28.857 -0400 ERROR SplunkOptimize [37311 MainThread] - (child_304072__SplunkOptimize) optimize finished: failed, see rc for more details, dir=/opt/splunk/var/lib/splunk/_metrics/db/hot_v1_70, rc=-4 (unsigned 252), errno=1

07-26-2022 08:56:51.759 -0400 INFO IndexWriter [37662 indexerPipe_1] - The index processor has paused data flow. Too many tsidx files in idx=_metrics bucket="/opt/splunk/var/lib/splunk/_metrics/db/hot_v1_70" , waiting for the splunk-optimize indexing helper to catch up merging them. Ensure reasonable disk space is available, and that I/O write throughput is not compromised.

 

Labels (1)
Tags (1)
1 Solution

sylim_splunk
Splunk Employee
Splunk Employee

The issue started from the splunk-optimize process unable to access the tsidx files for optimization and eventually gave up data feeding and wait until optimizer catch up the backlogs - the log messages like, 

-- splunkd.log

The index processor has paused data flow. Too many tsidx files in idx=_metrics bucket="/opt/splunk/var/lib/splunk/_metrics/db/hot_v1_804" , waiting for the splunk-optimize indexing helper to catch up merging them. Ensure reasonable disk space is available, and that I/O write throughput is not compromised.

-- Before this log there are a lot of error logs that complain about the access permissions(errno=1) for tsidx,

ERROR SplunkOptimize [37311 MainThread] - (child_304072__SplunkOptimize) optimize finished: failed, see rc for more details, dir=/opt/splunk/var/lib/splunk/_metrics/db/hot_v1_70, rc=-4 (unsigned 252), errno=1

 

The graph for 'index=_internal "see rc for more details" | timechart span=1m count ' will match the timing of queue blockage issue, which means that's the cause of the symptoms.

-- errno=1 (EPERM) why this happens?

The customer confirmed that the indexers run antivirus scanner with $SPLUNK_HOME and $SPLUNK_DB excluded but it's found to have missed the sub directories. Due to this and the scanner algorithms the access to the tsidx files by Splunk-optimize is considered to be suspicious and gets blocked, which eventually paused the data flow, the port 9997 also gets closed.

 

View solution in original post

sylim_splunk
Splunk Employee
Splunk Employee

The issue started from the splunk-optimize process unable to access the tsidx files for optimization and eventually gave up data feeding and wait until optimizer catch up the backlogs - the log messages like, 

-- splunkd.log

The index processor has paused data flow. Too many tsidx files in idx=_metrics bucket="/opt/splunk/var/lib/splunk/_metrics/db/hot_v1_804" , waiting for the splunk-optimize indexing helper to catch up merging them. Ensure reasonable disk space is available, and that I/O write throughput is not compromised.

-- Before this log there are a lot of error logs that complain about the access permissions(errno=1) for tsidx,

ERROR SplunkOptimize [37311 MainThread] - (child_304072__SplunkOptimize) optimize finished: failed, see rc for more details, dir=/opt/splunk/var/lib/splunk/_metrics/db/hot_v1_70, rc=-4 (unsigned 252), errno=1

 

The graph for 'index=_internal "see rc for more details" | timechart span=1m count ' will match the timing of queue blockage issue, which means that's the cause of the symptoms.

-- errno=1 (EPERM) why this happens?

The customer confirmed that the indexers run antivirus scanner with $SPLUNK_HOME and $SPLUNK_DB excluded but it's found to have missed the sub directories. Due to this and the scanner algorithms the access to the tsidx files by Splunk-optimize is considered to be suspicious and gets blocked, which eventually paused the data flow, the port 9997 also gets closed.

 

Get Updates on the Splunk Community!

Announcing Scheduled Export GA for Dashboard Studio

We're excited to announce the general availability of Scheduled Export for Dashboard Studio. Starting in ...

Extending Observability Content to Splunk Cloud

Watch Now!   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to leverage ...

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!

What if there was a way you could keep all the metrics data you need while saving on storage costs?This is now ...