Monitoring Splunk

IndexingQueue gets full on Indexers and forwarders not able to connect to it

sylim_splunk
Splunk Employee
Splunk Employee

Indexers are getting blocked periodically throughout the day, causing our heavy forwarders to stop forwarding data.

-- Error on HF --
07-25-2022 09:22:55.982 -0400 WARN AutoLoadBalancedConnectionStrategy [5212 TcpOutEloop] - Cooked connection to ip=10.3.13.4:9997 timed out
07-25-2022 09:27:14.858 -0400 WARN TcpOutputFd [5212 TcpOutEloop] - Connect to 10.3.13.4:9997 failed. Connection refused
07-25-2022 10:58:06.973 -0400 WARN TcpOutputFd [5034 TcpOutEloop] - Connect to 10.3.13.4:9997 failed. Connection refused

Whenever this was found on a HF the indexer with the IP had closed the port 9997 and indexer's queues were full.

 

-- Log messages on the indexer:

07-26-2022 08:51:28.857 -0400 ERROR SplunkOptimize [37311 MainThread] - (child_304072__SplunkOptimize) optimize finished: failed, see rc for more details, dir=/opt/splunk/var/lib/splunk/_metrics/db/hot_v1_70, rc=-4 (unsigned 252), errno=1

07-26-2022 08:56:51.759 -0400 INFO IndexWriter [37662 indexerPipe_1] - The index processor has paused data flow. Too many tsidx files in idx=_metrics bucket="/opt/splunk/var/lib/splunk/_metrics/db/hot_v1_70" , waiting for the splunk-optimize indexing helper to catch up merging them. Ensure reasonable disk space is available, and that I/O write throughput is not compromised.

 

Labels (1)
Tags (1)
1 Solution

sylim_splunk
Splunk Employee
Splunk Employee

The issue started from the splunk-optimize process unable to access the tsidx files for optimization and eventually gave up data feeding and wait until optimizer catch up the backlogs - the log messages like, 

-- splunkd.log

The index processor has paused data flow. Too many tsidx files in idx=_metrics bucket="/opt/splunk/var/lib/splunk/_metrics/db/hot_v1_804" , waiting for the splunk-optimize indexing helper to catch up merging them. Ensure reasonable disk space is available, and that I/O write throughput is not compromised.

-- Before this log there are a lot of error logs that complain about the access permissions(errno=1) for tsidx,

ERROR SplunkOptimize [37311 MainThread] - (child_304072__SplunkOptimize) optimize finished: failed, see rc for more details, dir=/opt/splunk/var/lib/splunk/_metrics/db/hot_v1_70, rc=-4 (unsigned 252), errno=1

 

The graph for 'index=_internal "see rc for more details" | timechart span=1m count ' will match the timing of queue blockage issue, which means that's the cause of the symptoms.

-- errno=1 (EPERM) why this happens?

The customer confirmed that the indexers run antivirus scanner with $SPLUNK_HOME and $SPLUNK_DB excluded but it's found to have missed the sub directories. Due to this and the scanner algorithms the access to the tsidx files by Splunk-optimize is considered to be suspicious and gets blocked, which eventually paused the data flow, the port 9997 also gets closed.

 

View solution in original post

sylim_splunk
Splunk Employee
Splunk Employee

The issue started from the splunk-optimize process unable to access the tsidx files for optimization and eventually gave up data feeding and wait until optimizer catch up the backlogs - the log messages like, 

-- splunkd.log

The index processor has paused data flow. Too many tsidx files in idx=_metrics bucket="/opt/splunk/var/lib/splunk/_metrics/db/hot_v1_804" , waiting for the splunk-optimize indexing helper to catch up merging them. Ensure reasonable disk space is available, and that I/O write throughput is not compromised.

-- Before this log there are a lot of error logs that complain about the access permissions(errno=1) for tsidx,

ERROR SplunkOptimize [37311 MainThread] - (child_304072__SplunkOptimize) optimize finished: failed, see rc for more details, dir=/opt/splunk/var/lib/splunk/_metrics/db/hot_v1_70, rc=-4 (unsigned 252), errno=1

 

The graph for 'index=_internal "see rc for more details" | timechart span=1m count ' will match the timing of queue blockage issue, which means that's the cause of the symptoms.

-- errno=1 (EPERM) why this happens?

The customer confirmed that the indexers run antivirus scanner with $SPLUNK_HOME and $SPLUNK_DB excluded but it's found to have missed the sub directories. Due to this and the scanner algorithms the access to the tsidx files by Splunk-optimize is considered to be suspicious and gets blocked, which eventually paused the data flow, the port 9997 also gets closed.

 

Get Updates on the Splunk Community!

🌟 From Audit Chaos to Clarity: Welcoming Audit Trail v2

🗣 You Spoke, We Listened Audit Trail v2 wasn’t written in isolation—it was shaped by your voices. In ...

Splunk Enterprise Security 8.x: The Essential Upgrade for Threat Detection, ...

 Prepare to elevate your security operations with the powerful upgrade to Splunk Enterprise Security 8.x! This ...

Get Early Access to AI Playbook Authoring: Apply for the Alpha Private Preview ...

Passionate about security automation? Apply now to our AI Playbook Authoring Alpha private preview ...