I have an environment with about 200 machines, all Windows Servers. All servers are sending TCP information through port 9997 directly to my Heavy Forwarder, all information is allocated in the "Windows" index
What happens is that about 1-2x a day, the logs sent by Universal Forwarders stop from all machines leaving the Windows index blank. All other data that do not arrive through TCP 9997 are normal, such as some scripts that bring other types of information and save in other indexes.
The problem is only solved when Splunk is restarted in Heavy Forwarder
Trying to diagnose the problem, the only thing I could find is this message on all servers with Universal Forwarder installed
02-16-2022 15:20:51.293 -0400 WARN TcpOutputProc - Tcpout Processor: The TCP output processor has paused the data flow. Forwarding to output group default-autolb-group has been blocked for 82200 seconds
Has anyone gone through something similar, or can help me try to identify what is happening?
Remembering that the Log in Heavy Forwader, doesn't bring me anything relevant
Thanks in advance!
you obviously have blocked queues at least on HF side maybe even idx side too. Easy way to look what it situation on HF side is add it as an indexer with e.g. IHF custom group defined into MC. Then you can easily look what are happening on those queues and pipelines on that (and another nodes). If you haven't MC on place yet, then I strongly recommend to set it up.
Here are two excellent conf presentation how to look the situation even without MC.
Use DMC to see what's going on with HF. UF logs suggest that HF (as defined in outputs.conf for stanza default-autolb-group) is down/unavailable causing data ingestion to stop. Use "Indexing Performance" dashboards in DMC to see if any queues are getting filled up.
Couple reasons for this issue
Please check if any queues are filling on the UF side , due to some sources sending too much data at once.
and any network issue between UF and HF , check in splunkd.log for timeout issues and check from the HF side as well.
also in splunkd.log check for any ERROR or WARN error
when we faced same issue, it turnout to be intermittent networks issues caused,
in your case it might be same issue or new one
some additional information:
it seems that sometimes, when a scheduled job starts, your forwarding stops.