We have this case for multiple servers where we see constant errors such as -
10-04-2018 13:25:55.480 -0500 INFO TcpOutputProc - Removing quarantine from idx=<indexer1>:9997
10-04-2018 13:25:55.480 -0500 INFO TcpOutputProc - Removing quarantine from idx=<indexer2>:9997
10-04-2018 13:25:55.480 -0500 INFO TcpOutputProc - Removing quarantine from idx=<indexer3>:9997
10-04-2018 13:25:55.480 -0500 INFO TcpOutputProc - Removing quarantine from idx=<indexer4>:9997
10-04-2018 13:25:55.489 -0500 ERROR TcpOutputFd - Read error. Connection reset by peer
10-04-2018 13:25:55.491 -0500 ERROR TcpOutputFd - Read error. Connection reset by peer
10-04-2018 13:25:55.493 -0500 ERROR TcpOutputFd - Read error. Connection reset by peer
10-04-2018 13:25:55.501 -0500 ERROR TcpOutputFd - Read error. Connection reset by peer
10-04-2018 13:25:55.509 -0500 ERROR TcpOutputFd - Read error. Connection reset by peer
telnet is fine -
telnet <indexer1> 9997
telnet <indexer2> 9997
telnet <indexer3> 9997
telnet <indexer4> 9997
We see lots of data for -
index=_internal host=<host name>
And nothing -
index=<customer index> host=<host name>
Hi @ddrillic,
As you mentioned that connectivity looks good from UF to Indexer then I'll first check congestion on Indexer or Indexer Resource Usage (Might too busy - Resource constraint) to start troubleshooting.
Makes sense @harsmarvania57, but all other data sources look fine.
When searching for a specific path - index=_internal host=<host name> <path>
I see -
10-07-2018 14:04:24.766 -0500 INFO Metrics - group=per_source_thruput, series="<path to log file>", kbps=0.05002527423023678, eps=0.419355447451456, kb=1.55078125, ev=13, avg_age=0.6153846153846154, max_age=3
What does it mean?
This is metrics.log
entry which indicates per_source_thruput
from UF to Indexer for each and every source which you are monitoring. For detailed explanation of different fields in above log message please check https://docs.splunk.com/Documentation/Splunk/7.1.2/Troubleshooting/Aboutmetricslog#Thruput_messages
We do see the following error -
-- 10-12-2018 14:36:41.029 -0500 WARN TcpOutputProc - Tcpout Processor: The TCP output processor has paused the data flow. Forwarding to output group indexers has been blocked for 410303 seconds. This will probably stall the data flow towards indexing and other network outputs. Review the receiving system's health in the Splunk Monitoring Console. It is probably not accepting data.
This means that there might be possibility that different queues on Indexers are blocking for various reason, you need to check Monitoring Console for blocking queues status.