Solved: Re: Heavy Forwarder fills queues (but not always) ...

davidstuffle · ‎08-24-2019

We're trying to do:
UF (Win Event Logs) --> HF (v7.2.5 on Linux) --> Indexers (Linux) -AND- external Syslog destination.

This works, but only sporadically at an acceptable rate. Most of the time I start up the HF, it routes the data properly to Indexers and Syslog destination, but it's extremely slow (only one event per second or so) and then the queues start to get blocked (the indexer queue first). Data still reaches each destination but very slowly.

Every now and then, I restart the HF and immediately works and it's blazing fast and continues to work - unless I restart the HF again (with NO changes) and it bogs down again.

When it's working, events make it from the Windows UF to the indexers and Syslog within one second. When it's not working, it just gets increasingly further behind, but is STILL routing the data to each destination very slowly.

If I remove the syslog pointer in props.conf so it goes ONLY to the indexers, it works just fine every time.

Here is the config:

props.conf:
[host::winhost01]
TRANSFORMS-ntdc = indexers, extSyslog

transforms.conf:
[indexers]
REGEX = (.)
DEST_KEY = _TCP_ROUTING
FORMAT = primary_indexers

[extSyslog]
REGEX = (.)
DEST_KEY = _SYSLOG_ROUTING
FORMAT = destSyslog

outputs.conf:
[syslog:destSyslog]
server = 10.11.190.163:514
type = udp

I have a Splunk support case open and sent multiple diags but we're all stumped as to what's going on here. We've checked the bandwidth to the syslog destination, tried a couple different internal ones, changed settings here and there, and I've reinstalled Splunk on the HF. This has plagued me for 3-4 weeks now.

Any ideas would be appreciated.

davidstuffle · ‎09-07-2019

The problem here ended up being that we had a useACK = true in an outputs.conf file without a stanza tag above it. Therefore, it applied to all output, including the [syslog] output. A syslog server will not send back an ACK. Splunk will wait 2 seconds for each event and then send the event anyway (based on our observations). We added a [tcpout] above the "useACK = true" setting so it would apply only to tcpout and not to syslog output and that fixed this.

Early in the troubleshooting, we did hit on this setting. We added a useACK = false to the syslog stanza, but that still doesn't disable the useACK apparently. I've even explicitly tried again to set that to false under the syslog stanza but it doesn't seem to matter. If it's set to true globally, that seems to take affect.

I still can't explain why maybe 1 out of 10 times we restarted, it would work just fine even though this config error was still present.

Thanks to Jack Herod from Splunk support for finally finding this configuration error. If you're at .conf, I owe you a beer.

View solution in original post

davidstuffle · ‎09-07-2019

The problem here ended up being that we had a useACK = true in an outputs.conf file without a stanza tag above it. Therefore, it applied to all output, including the [syslog] output. A syslog server will not send back an ACK. Splunk will wait 2 seconds for each event and then send the event anyway (based on our observations). We added a [tcpout] above the "useACK = true" setting so it would apply only to tcpout and not to syslog output and that fixed this.

Early in the troubleshooting, we did hit on this setting. We added a useACK = false to the syslog stanza, but that still doesn't disable the useACK apparently. I've even explicitly tried again to set that to false under the syslog stanza but it doesn't seem to matter. If it's set to true globally, that seems to take affect.

I still can't explain why maybe 1 out of 10 times we restarted, it would work just fine even though this config error was still present.

Thanks to Jack Herod from Splunk support for finally finding this configuration error. If you're at .conf, I owe you a beer.

davidstuffle · ‎08-27-2019

Can anyone confirm that you are successfully routing data from a HF to both an indexer and to a Syslog destination?

davidstuffle · ‎08-27-2019

I'll buy beer at .conf19 for anyone who can find the resolution to this.

DavidHourani · ‎08-27-2019

Hi @davidstuffle,

Some questions to help you find the right way :

1-Have you tried correlating that with the network quality of your link to check for any possible network issues ?

2- Have you tried correlating with the ingested data load -- i.e.: does your HF get slower when more data is coming in ?

3- Have you tried adding an extra data ingestion pipeline -- it usually helps improve pipeline health :
https://docs.splunk.com/Documentation/Forwarder/7.3.1/Forwarder/Configureaforwardertohandlemultiplep...

Cheers,
David

davidstuffle · ‎08-27-2019

I have tested throughput to the Syslog destination a few different ways and it shows way more than enough.

I've set the UF so it's only sending internal logs, and it still ends up with blocked queues on the HF.

We did have 2 pipelines at one point, but support suggested changing back to one.

The kicker here is that it works sometimes. There have been several instances where I restart the HF and data flies through like a charm. It will continue to work just fine until I restart the HF. Then it slows down again and multiple subsequent restarts won't get it working again.

DavidHourani · ‎08-27-2019

Is this a virtual instance ? Maybe you are sharing resources with another application ?

---Asking creative questions to get that beer lol

davidstuffle · ‎08-27-2019

They are virtual servers. But we see no indication that there is a resource issue.

DavidHourani · ‎08-27-2019

well... if it's spiking then:

A- It's either a misconfigured parameter (which I assume is not because support has already had a look). You can still verify that by completely reseting the HF configuration --If that's possible-- and seeing with less configs if you still get those spikes. Then add more and more until you hit the spikes to identify the app causing it.

B- Or it's something, that you don't control
--> Network -- I don't mean the throughput or connectivity issues, I mean other applications that might have priority on the network and can cause your traffic to be dropped based on some QOS rule.
--> Shared resources -- That one is the most probable from what your describing. With clean configuration and on virtualized hosts the only thing that can slow down your server is reduced resources caused by another app using it.

B is harder to identify as you don't really know what happening there unless you contact your network team or your VM team.

Heavy Forwarder fills queues (but not always) when forwarding to external Syslog

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Unlocking Unified Insights: New Gigamon Federated Search App for Splunk

GA: New Data Management App in Splunk Platform

Announcing Modern Navigation: A New Era of Splunk User Experience

Join the Conversation