We have a distributed architecture
Search head cluster with 6 hosts across 3 data centres
Index cluster with 6 index peers and 1 index master
Forwarders on all servers in environment - web tier, app tier, load balancer tier
Few months back , web tier stopped sending - log stopped coming to splunk ; but other tiers are are working
When checked the activity on web-tier , there was a patching happened and splunkd was restarted -after that forwarding stopped in web-tier
But splunkd process came up fine - still running in those
And observed below WARN messages started coming exactly same time
[ See the highlighted in red starting from 10 seconds it grows ]
WARN TcpOutputProc - Tcpout Processor: The TCP output processor has paused the data flow. Forwarding to output group index_peers has been blocked for 10 seconds. This will probably stall the data flow towards indexing and other network outputs. Review the receiving system's health in the Splunk Monitoring Console. It is probably not accepting data.
----------
------
+0000 WARN TcpOutputProc - Tcpout Processor: The TCP output processor has paused the data flow. Forwarding to output group index-peers has been blocked for 9725460 seconds. This will probably stall the data flow towards indexing and other network outputs. Review the receiving system's health in the Splunk Monitoring Console. It is probably not accepting data.
+0000 WARN TcpOutputProc - Tcpout Processor: The TCP output processor has paused the data flow. Forwarding to output group index-peers has been blocked for 9725470 seconds. This will probably stall the data flow towards indexing and other network outputs. Review the receiving system's health in the Splunk Monitoring Console. It is probably not accepting data.
=============================================================================
Why we picked this WARN message may be cause - as same happened in other tier recently
load lancer tier stopped stopped forwarding recently. Above WARN started showing same time onwards - starting with "blocked for 10 seconds "
What could have caused this sudden stopping of forwarding ?
Version:7.2.1Build:be11b2c46e23
This sounds more of a network connectivity problem...
Are you able to test port 9997 on the affected forwarders to ensure it's open?
Could it be that these affected servers are on a different VLAN?
Are the indexers listed in your outputs.conf listed as FQDNs or IPs? Can you run the nslookup command on those indexers from your forwarders?
We couldn't observe any connectivity issue as such. As I mentioned it's only few forwarders stopped working .Rest of the forwarders still sending data to same set of index clusters
And the port we use is 8089 , which is open
splunkuser@fwdernode system]$ telnet idxnode1.iuser.iroot.adidom.com 8089
Trying xx.yy.zz.aa...
Connected to idxnode1.iuser.iroot.adidom.com.
Escape character is '^]'.
outputs.conf
[tcpout]
defaultGroup = index_peers
Indexers listed as index_peers
server.conf on each indexer node 1
[general]
serverName = node1_idx01
server.conf on each indexer node 2
[general]
serverName = node2_idx01
Thank you that's helpful.
Can you read splunkd.log while you restart the splunkforwarder service on the affected machines. What does it say before it starts saying "Tcpout Processor: The TCP output processor has paused the data flow. Forwarding to output group index-peers has been blocked"?
I had restarted splunkd in one of the affected forwarders .Here is the log flow of splunkd.log
splunkd.log
Just before shutdown
03-28-2022 11:24:23.074 +0100 WARN TcpOutputProc - Tcpout Processor: The TCP output processor has paused the data flow. Forwarding to output group index_peers has been blocked for 9989240 seconds. This will probably stall the data flow towards indexing and other network outputs. Review the receiving system's health in the Splunk Monitoring Console. It is probably not accepting data.
03-28-2022 11:24:33.085 +0100 WARN TcpOutputProc - Tcpout Processor: The TCP output processor has paused the data flow. Forwarding to output group index_peers has been blocked for 9989250 seconds. This will probably stall the data flow towards indexing and other network outputs. Review the receiving system's health in the Splunk Monitoring Console. It is probably not accepting data.
03-28-2022 11:24:37.798 +0100 INFO PipelineComponent - Performing early shutdown tasks
03-28-2022 11:24:37.798 +0100 INFO IndexProcessor - handleSignal : Disabling streaming searches.
03-28-2022 11:24:37.798 +0100 INFO IndexProcessor - request state change from=RUN to=SHUTDOWN_SIGNALED
03-28-2022 11:24:37.798 +0100 INFO loader - Shutdown HTTPDispatchThread
03-28-2022 11:24:37.798 +0100 INFO ShutdownHandler - Shutting down splunkd
starting splunkd
03-28-2022 11:25:40.952 +0100 INFO loader - Splunkd starting (build be11b2c46e23).
03-28-2022 11:25:46.164 +0100 INFO TcpOutputProc - Will resolve indexer names at 450.000 second interval.
03-28-2022 11:25:46.589 +0100 WARN TailReader - Could not send data to output queue (parsingQueue), retrying...
03-28-2022 11:25:46.614 +0100 WARN TailReader - Could not send data to output queue (parsingQueue), retrying...
03-28-2022 11:26:11.164 +0100 INFO TcpOutputProc - Initialization time for indexer discovery service for default group=index_peers has been completed.
03-28-2022 11:26:11.212 +0100 INFO ScheduledViewsReaper - Scheduled views reaper run complete. Reaped count=0 scheduled views
03-28-2022 11:26:11.215 +0100 INFO TailReader - Continuing...
03-28-2022 11:26:11.215 +0100 INFO TailReader - ...continuing.
Again " TcpOutputProc" start appearing - reset with 10 seconds
03-28-2022 11:26:11.411 +0100 WARN TcpOutputProc - Applying quarantine to ip=xx.yy.zz.aa port=xxxx _numberOfFailures=2
03-28-2022 11:26:21.036 +0100 WARN TcpOutputProc - Tcpout Processor: The TCP output processor has paused the data flow. Forwarding to output group index_peers has been blocked for 10 seconds. This will probably stall the data flow towards indexing and other network outputs. Review the receiving system's health in the Splunk Monitoring Console. It is probably not accepting data.
03-28-2022 11:26:21.218 +0100 WARN TailReader - Could not send data to output queue (parsingQueue), retrying...
03-28-2022 11:26:21.218 +0100 WARN TailReader - Could not send data to output queue (parsingQueue), retrying...
03-28-2022 11:26:31.047 +0100 WARN TcpOutputProc - Tcpout Processor: The TCP output processor has paused the data flow. Forwarding to output group index_peers has been blocked for 20 seconds. This will probably stall the data flow towards indexing and other network outputs. Review the receiving system's health in the Splunk Monitoring Console. It is probably not accepting data.
This has me stumped. 🤔
@yj055 wrote:Again " TcpOutputProc" start appearing - reset with 10 seconds
03-28-2022 11:26:11.411 +0100 WARN TcpOutputProc - Applying quarantine to ip=xx.yy.zz.aa port=xxxx _numberOfFailures=2
Was this one of the correct IP/port combinations for your indexer or are you using indexer discovery?
ip=xx.yy.zz.aa port=xxxx
Maybe you can try to completely uninstall the forwarder and reinstall.
Yes , they are the correct IP and port of the indexer
the log message are there for all the index peers with correct receiver port (means have multiple occurrences of below log for each of the indexer ip and port )
03-28-2022 11:26:11.411 +0100 WARN TcpOutputProc - Applying quarantine to ip=xx.yy.zz.aa port=xxxx _numberOfFailures=2
As I had mentioned initial post , the entire set web tier forwarders stopped working few months back
Load balancer tier forwarder was working then until last week . LB forwarders stopped , splunkd log started showing same WARN messages TcpOutputProc" : tcpout Processor: The TCP output processor has paused the data flow
I see the same messages in the console health notification as well as " TCPOutAutoLB-0" warning
we would like to know what had caused all these production forwarders to stop suddenly before we decide and plan for fresh install
Like more may stop , as of now application tier forwarders are still working - those source types still giving data