Monitoring Splunk

IndexingQueue gets full on Indexers and forwarders not able to connect to it

sylim_splunk
Splunk Employee
Splunk Employee

Indexers are getting blocked periodically throughout the day, causing our heavy forwarders to stop forwarding data.

-- Error on HF --
07-25-2022 09:22:55.982 -0400 WARN AutoLoadBalancedConnectionStrategy [5212 TcpOutEloop] - Cooked connection to ip=10.3.13.4:9997 timed out
07-25-2022 09:27:14.858 -0400 WARN TcpOutputFd [5212 TcpOutEloop] - Connect to 10.3.13.4:9997 failed. Connection refused
07-25-2022 10:58:06.973 -0400 WARN TcpOutputFd [5034 TcpOutEloop] - Connect to 10.3.13.4:9997 failed. Connection refused

Whenever this was found on a HF the indexer with the IP had closed the port 9997 and indexer's queues were full.

 

-- Log messages on the indexer:

07-26-2022 08:51:28.857 -0400 ERROR SplunkOptimize [37311 MainThread] - (child_304072__SplunkOptimize) optimize finished: failed, see rc for more details, dir=/opt/splunk/var/lib/splunk/_metrics/db/hot_v1_70, rc=-4 (unsigned 252), errno=1

07-26-2022 08:56:51.759 -0400 INFO IndexWriter [37662 indexerPipe_1] - The index processor has paused data flow. Too many tsidx files in idx=_metrics bucket="/opt/splunk/var/lib/splunk/_metrics/db/hot_v1_70" , waiting for the splunk-optimize indexing helper to catch up merging them. Ensure reasonable disk space is available, and that I/O write throughput is not compromised.

 

Labels (1)
Tags (1)
1 Solution

sylim_splunk
Splunk Employee
Splunk Employee

The issue started from the splunk-optimize process unable to access the tsidx files for optimization and eventually gave up data feeding and wait until optimizer catch up the backlogs - the log messages like, 

-- splunkd.log

The index processor has paused data flow. Too many tsidx files in idx=_metrics bucket="/opt/splunk/var/lib/splunk/_metrics/db/hot_v1_804" , waiting for the splunk-optimize indexing helper to catch up merging them. Ensure reasonable disk space is available, and that I/O write throughput is not compromised.

-- Before this log there are a lot of error logs that complain about the access permissions(errno=1) for tsidx,

ERROR SplunkOptimize [37311 MainThread] - (child_304072__SplunkOptimize) optimize finished: failed, see rc for more details, dir=/opt/splunk/var/lib/splunk/_metrics/db/hot_v1_70, rc=-4 (unsigned 252), errno=1

 

The graph for 'index=_internal "see rc for more details" | timechart span=1m count ' will match the timing of queue blockage issue, which means that's the cause of the symptoms.

-- errno=1 (EPERM) why this happens?

The customer confirmed that the indexers run antivirus scanner with $SPLUNK_HOME and $SPLUNK_DB excluded but it's found to have missed the sub directories. Due to this and the scanner algorithms the access to the tsidx files by Splunk-optimize is considered to be suspicious and gets blocked, which eventually paused the data flow, the port 9997 also gets closed.

 

View solution in original post

sylim_splunk
Splunk Employee
Splunk Employee

The issue started from the splunk-optimize process unable to access the tsidx files for optimization and eventually gave up data feeding and wait until optimizer catch up the backlogs - the log messages like, 

-- splunkd.log

The index processor has paused data flow. Too many tsidx files in idx=_metrics bucket="/opt/splunk/var/lib/splunk/_metrics/db/hot_v1_804" , waiting for the splunk-optimize indexing helper to catch up merging them. Ensure reasonable disk space is available, and that I/O write throughput is not compromised.

-- Before this log there are a lot of error logs that complain about the access permissions(errno=1) for tsidx,

ERROR SplunkOptimize [37311 MainThread] - (child_304072__SplunkOptimize) optimize finished: failed, see rc for more details, dir=/opt/splunk/var/lib/splunk/_metrics/db/hot_v1_70, rc=-4 (unsigned 252), errno=1

 

The graph for 'index=_internal "see rc for more details" | timechart span=1m count ' will match the timing of queue blockage issue, which means that's the cause of the symptoms.

-- errno=1 (EPERM) why this happens?

The customer confirmed that the indexers run antivirus scanner with $SPLUNK_HOME and $SPLUNK_DB excluded but it's found to have missed the sub directories. Due to this and the scanner algorithms the access to the tsidx files by Splunk-optimize is considered to be suspicious and gets blocked, which eventually paused the data flow, the port 9997 also gets closed.

 

Got questions? Get answers!

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Meet up IRL or virtually!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Get Updates on the Splunk Community!

[Puzzles] Solve, Learn, Repeat: Matching cron expressions

This puzzle (first published here) is based on matching timestamps to cron expressions.All the timestamps ...

Design, Compete, Win: Submit Your Best Splunk Dashboards for a .conf26 Pass

Hello Splunkers,  We’re excited to kick off a Splunk Dashboard contest! We know that dashboards are a primary ...

May 2026 Splunk Expert Sessions: Security & Observability

Level Up Your Operations: May 2026 Splunk Expert Sessions Whether you are refining your security posture or ...