Indexers are getting blocked periodically throughout the day, causing our heavy forwarders to stop forwarding data.
-- Error on HF --
07-25-2022 09:22:55.982 -0400 WARN AutoLoadBalancedConnectionStrategy [5212 TcpOutEloop] - Cooked connection to ip=10.3.13.4:9997 timed out
07-25-2022 09:27:14.858 -0400 WARN TcpOutputFd [5212 TcpOutEloop] - Connect to 10.3.13.4:9997 failed. Connection refused
07-25-2022 10:58:06.973 -0400 WARN TcpOutputFd [5034 TcpOutEloop] - Connect to 10.3.13.4:9997 failed. Connection refused
Whenever this was found on a HF the indexer with the IP had closed the port 9997 and indexer's queues were full.
-- Log messages on the indexer:
07-26-2022 08:51:28.857 -0400 ERROR SplunkOptimize [37311 MainThread] - (child_304072__SplunkOptimize) optimize finished: failed, see rc for more details, dir=/opt/splunk/var/lib/splunk/_metrics/db/hot_v1_70, rc=-4 (unsigned 252), errno=1
07-26-2022 08:56:51.759 -0400 INFO IndexWriter [37662 indexerPipe_1] - The index processor has paused data flow. Too many tsidx files in idx=_metrics bucket="/opt/splunk/var/lib/splunk/_metrics/db/hot_v1_70" , waiting for the splunk-optimize indexing helper to catch up merging them. Ensure reasonable disk space is available, and that I/O write throughput is not compromised.
The issue started from the splunk-optimize process unable to access the tsidx files for optimization and eventually gave up data feeding and wait until optimizer catch up the backlogs - the log messages like,
-- splunkd.log
The index processor has paused data flow. Too many tsidx files in idx=_metrics bucket="/opt/splunk/var/lib/splunk/_metrics/db/hot_v1_804" , waiting for the splunk-optimize indexing helper to catch up merging them. Ensure reasonable disk space is available, and that I/O write throughput is not compromised.
-- Before this log there are a lot of error logs that complain about the access permissions(errno=1) for tsidx,
ERROR SplunkOptimize [37311 MainThread] - (child_304072__SplunkOptimize) optimize finished: failed, see rc for more details, dir=/opt/splunk/var/lib/splunk/_metrics/db/hot_v1_70, rc=-4 (unsigned 252), errno=1
The graph for 'index=_internal "see rc for more details" | timechart span=1m count ' will match the timing of queue blockage issue, which means that's the cause of the symptoms.
-- errno=1 (EPERM) why this happens?
The customer confirmed that the indexers run antivirus scanner with $SPLUNK_HOME and $SPLUNK_DB excluded but it's found to have missed the sub directories. Due to this and the scanner algorithms the access to the tsidx files by Splunk-optimize is considered to be suspicious and gets blocked, which eventually paused the data flow, the port 9997 also gets closed.
The issue started from the splunk-optimize process unable to access the tsidx files for optimization and eventually gave up data feeding and wait until optimizer catch up the backlogs - the log messages like,
-- splunkd.log
The index processor has paused data flow. Too many tsidx files in idx=_metrics bucket="/opt/splunk/var/lib/splunk/_metrics/db/hot_v1_804" , waiting for the splunk-optimize indexing helper to catch up merging them. Ensure reasonable disk space is available, and that I/O write throughput is not compromised.
-- Before this log there are a lot of error logs that complain about the access permissions(errno=1) for tsidx,
ERROR SplunkOptimize [37311 MainThread] - (child_304072__SplunkOptimize) optimize finished: failed, see rc for more details, dir=/opt/splunk/var/lib/splunk/_metrics/db/hot_v1_70, rc=-4 (unsigned 252), errno=1
The graph for 'index=_internal "see rc for more details" | timechart span=1m count ' will match the timing of queue blockage issue, which means that's the cause of the symptoms.
-- errno=1 (EPERM) why this happens?
The customer confirmed that the indexers run antivirus scanner with $SPLUNK_HOME and $SPLUNK_DB excluded but it's found to have missed the sub directories. Due to this and the scanner algorithms the access to the tsidx files by Splunk-optimize is considered to be suspicious and gets blocked, which eventually paused the data flow, the port 9997 also gets closed.