Why does universal forwarder stop sending data?

JuanAntunes · ‎02-17-2022

Hello!

I have an environment with about 200 machines, all Windows Servers. All servers are sending TCP information through port 9997 directly to my Heavy Forwarder, all information is allocated in the "Windows" index

What happens is that about 1-2x a day, the logs sent by Universal Forwarders stop from all machines leaving the Windows index blank. All other data that do not arrive through TCP 9997 are normal, such as some scripts that bring other types of information and save in other indexes.

The problem is only solved when Splunk is restarted in Heavy Forwarder

Trying to diagnose the problem, the only thing I could find is this message on all servers with Universal Forwarder installed

02-16-2022 15:20:51.293 -0400 WARN TcpOutputProc - Tcpout Processor: The TCP output processor has paused the data flow. Forwarding to output group default-autolb-group has been blocked for 82200 seconds

Has anyone gone through something similar, or can help me try to identify what is happening?
Remembering that the Log in Heavy Forwader, doesn't bring me anything relevant

Thanks in advance!

isoutamo · ‎02-17-2022

Hi

you obviously have blocked queues at least on HF side maybe even idx side too. Easy way to look what it situation on HF side is add it as an indexer with e.g. IHF custom group defined into MC. Then you can easily look what are happening on those queues and pipelines on that (and another nodes). If you haven't MC on place yet, then I strongly recommend to set it up.

Here are two excellent conf presentation how to look the situation even without MC.

https://conf.splunk.com/files/2019/slides/FN1570.pdf
https://conf.splunk.com/files/2019/slides/FN1402.pdf
https://github.com/silkyrich/cluster_health_tools (git repo for previous presentation)

r. Ismo

somesoni2 · ‎02-17-2022

Use DMC to see what's going on with HF. UF logs suggest that HF (as defined in outputs.conf for stanza default-autolb-group) is down/unavailable causing data ingestion to stop. Use "Indexing Performance" dashboards in DMC to see if any queues are getting filled up.

SanjayReddy · ‎02-17-2022

Hi @JuanAntunes

Couple reasons for this issue

Please check if any queues are filling on the UF side , due to some sources sending too much data at once.

and any network issue between UF and HF , check in splunkd.log for timeout issues and check from the HF side as well.

also in splunkd.log check for any ERROR or WARN error

when we faced same issue, it turnout to be intermittent networks issues caused,

in your case it might be same issue or new one

gcusello · ‎02-17-2022

Hi @JuanAntunes,

some additional information:

have you used the correct reference hardware for your HF?
which other jobs are scheduled in your Heavy Forwarder?
are you sure that, when forwardring stops, there isn't any job that usues the available bandwidth?

it seems that sometimes, when a scheduled job starts, your forwarding stops.

Ciao.

Giuseppe

Why does universal forwarder stop sending data?

heavy forwarder

index

indexer

universal forwarder

Windows

Introducing Splunk Enterprise 9.2

Adoption of RUM and APM at Splunk

Routing logs with Splunk OTel Collector for Kubernetes