Hi all, hoping someone can help me.
We have a number of Windows servers with the Universal Forwarder installed (9.3.0) and they are configured to forward logs to an internal heavy forwarder server running Linux.
Recently we've seen crashes on the Windows servers which seem to be because Splunk-MonitorNoHandle is taking more and more RAM until there is none left. I have therefore limited the RAM that Splunk can take to stop the crashing. However, I need to understand the root cause.
It seems to me that the reason is because the HF is blocking the connection for some reason, and when that happens the Windows server has to cache the entries in memory. Once the connection is blocked, it never seems to unblock and the backlog just keeps getting bigger and bigger.
Here is an example from the log:
08-21-2024 16:42:13.223 +0100 WARN  TcpOutputProc [6844 parsing] - The TCP output processor has paused the data flow. Forwarding to host_dest=splunkhf02.mydomain.net inside output group default-autolb-group from host_src=WINDOWS02 has been blocked for blocked_seconds=54300. This can stall the data flow towards indexing and other network outputs. Review the receiving system's health in the Splunk Monitoring Console. It is probably not accepting data.I tried setting maxKBps to 0 in limits.conf on the Windows server, I also tried 256 and 512 but we're still having the same problems.
If I restart the Splunk service it 'solves' the issue but of course it also loses all of the log entries from the buffer in RAM.
Can anyone help me to understand the process here?
Thanks for any assistance!
Hmm, after further investigation it appears that it might not be anything to do with the throughput settings on either server. Digging into the logs, this problem always begins when the Heavy Forwarder patches. At this point the Windows server stops being able to send logs and never recovers even when the HF is available again. I wonder if this is related to v9.3.0 of the agent, because we didn't see any issues before this was upgraded.
