Monitoring Splunk

IndexingQueue gets full on Indexers and forwarders not able to connect to it

sylim_splunk
Splunk Employee
Splunk Employee

Indexers are getting blocked periodically throughout the day, causing our heavy forwarders to stop forwarding data.

-- Error on HF --
07-25-2022 09:22:55.982 -0400 WARN AutoLoadBalancedConnectionStrategy [5212 TcpOutEloop] - Cooked connection to ip=10.3.13.4:9997 timed out
07-25-2022 09:27:14.858 -0400 WARN TcpOutputFd [5212 TcpOutEloop] - Connect to 10.3.13.4:9997 failed. Connection refused
07-25-2022 10:58:06.973 -0400 WARN TcpOutputFd [5034 TcpOutEloop] - Connect to 10.3.13.4:9997 failed. Connection refused

Whenever this was found on a HF the indexer with the IP had closed the port 9997 and indexer's queues were full.

 

-- Log messages on the indexer:

07-26-2022 08:51:28.857 -0400 ERROR SplunkOptimize [37311 MainThread] - (child_304072__SplunkOptimize) optimize finished: failed, see rc for more details, dir=/opt/splunk/var/lib/splunk/_metrics/db/hot_v1_70, rc=-4 (unsigned 252), errno=1

07-26-2022 08:56:51.759 -0400 INFO IndexWriter [37662 indexerPipe_1] - The index processor has paused data flow. Too many tsidx files in idx=_metrics bucket="/opt/splunk/var/lib/splunk/_metrics/db/hot_v1_70" , waiting for the splunk-optimize indexing helper to catch up merging them. Ensure reasonable disk space is available, and that I/O write throughput is not compromised.

 

Labels (1)
Tags (1)
1 Solution

sylim_splunk
Splunk Employee
Splunk Employee

The issue started from the splunk-optimize process unable to access the tsidx files for optimization and eventually gave up data feeding and wait until optimizer catch up the backlogs - the log messages like, 

-- splunkd.log

The index processor has paused data flow. Too many tsidx files in idx=_metrics bucket="/opt/splunk/var/lib/splunk/_metrics/db/hot_v1_804" , waiting for the splunk-optimize indexing helper to catch up merging them. Ensure reasonable disk space is available, and that I/O write throughput is not compromised.

-- Before this log there are a lot of error logs that complain about the access permissions(errno=1) for tsidx,

ERROR SplunkOptimize [37311 MainThread] - (child_304072__SplunkOptimize) optimize finished: failed, see rc for more details, dir=/opt/splunk/var/lib/splunk/_metrics/db/hot_v1_70, rc=-4 (unsigned 252), errno=1

 

The graph for 'index=_internal "see rc for more details" | timechart span=1m count ' will match the timing of queue blockage issue, which means that's the cause of the symptoms.

-- errno=1 (EPERM) why this happens?

The customer confirmed that the indexers run antivirus scanner with $SPLUNK_HOME and $SPLUNK_DB excluded but it's found to have missed the sub directories. Due to this and the scanner algorithms the access to the tsidx files by Splunk-optimize is considered to be suspicious and gets blocked, which eventually paused the data flow, the port 9997 also gets closed.

 

View solution in original post

sylim_splunk
Splunk Employee
Splunk Employee

The issue started from the splunk-optimize process unable to access the tsidx files for optimization and eventually gave up data feeding and wait until optimizer catch up the backlogs - the log messages like, 

-- splunkd.log

The index processor has paused data flow. Too many tsidx files in idx=_metrics bucket="/opt/splunk/var/lib/splunk/_metrics/db/hot_v1_804" , waiting for the splunk-optimize indexing helper to catch up merging them. Ensure reasonable disk space is available, and that I/O write throughput is not compromised.

-- Before this log there are a lot of error logs that complain about the access permissions(errno=1) for tsidx,

ERROR SplunkOptimize [37311 MainThread] - (child_304072__SplunkOptimize) optimize finished: failed, see rc for more details, dir=/opt/splunk/var/lib/splunk/_metrics/db/hot_v1_70, rc=-4 (unsigned 252), errno=1

 

The graph for 'index=_internal "see rc for more details" | timechart span=1m count ' will match the timing of queue blockage issue, which means that's the cause of the symptoms.

-- errno=1 (EPERM) why this happens?

The customer confirmed that the indexers run antivirus scanner with $SPLUNK_HOME and $SPLUNK_DB excluded but it's found to have missed the sub directories. Due to this and the scanner algorithms the access to the tsidx files by Splunk-optimize is considered to be suspicious and gets blocked, which eventually paused the data flow, the port 9997 also gets closed.

 

Get Updates on the Splunk Community!

Observe and Secure All Apps with Splunk

  Join Us for Our Next Tech Talk: Observe and Secure All Apps with SplunkAs organizations continue to innovate ...

Splunk Decoded: Business Transactions vs Business IQ

It’s the morning of Black Friday, and your e-commerce site is handling 10x normal traffic. Orders are flowing, ...

Fastest way to demo Observability

I’ve been having a lot of fun learning about Kubernetes and Observability. I set myself an interesting ...