Hello i have two windows event collectors. 3 domain controllers send their events to one event collector (WEC01), and three send their events to another event collector.(WEC02)
From 8.00 onwards (eg the start of the working day) the events from WEC02 are getting progressively delayed up to about 20,000 seconds behind, before eventually catching up by about 4AM in the morning.
Both systems have the same configurations on them, which are managed by a deployment server.
I have looked at:
https://answers.splunk.com/answers/224727/why-is-my-universal-forwarder-showing-extreme-lag.html?utm...
And various other posts and have the following set:
limits.conf
[thruput]
maxKBps = 0
Outputs.conf
There doesnt appear to be any blockage in terms of indexer queues as other events are indexed fine and there is no latency. CPU, Memory and Network is all fine on the virtual machine. I can see no obvious reason why there is a delay.
Both Windows Event collectors are virtual machines. They may be on different physical hosts. There is a difference in latency in packets between the two hosts.
Here is a screenshot from the resouce monitor, network activity.
Slow Windows Event Collector (High Latency)
Fast Windows Event Collector (low latency)
were you able to find any fix to this ? if yes, please share.
It's a three years old thread so the people might not even be active on this community anymore.
Having said that - I scrolled through the whole thread and I don't think anyone mentioned checking the throughput limit. If it's too low it might be causing this queue buildup. Since forwarder is not able to send events as fast as it's reading them.
-evt_resolve_ad_obj is set to 0 in inputs.conf
-maxKbps is set to 0 in limits.conf
These settings fixed it for me
Exactly, I confirm in our environment the issue was mainly due to thruput constraint, but we put 25 MB/s instead of unlimited. The only strange thing that didn't allowed us to quickly identify the root cause was that, for the windows events locally generated by the WEC server itself, the Splunk Universal Forwarder had no delay collecting them. The only delay was observed on forwarding with the Splunk Universal Forwarder the events stored by the Windows Event Collector (WEC) coming from the other machines through Windows Event Forwarding (WEF).
limits.conf
[thruput]
maxKBps = 25600
To understand the thruput limit in your environment you can use this query (stay quite higher than the maximum you observe)
index=_internal sourcetype=splunkd group=tcpin_connections (connectionType=cooked OR connectionType=cookedSSL) hostname=your_WEC_host
| timechart minspan=30s max(eval(tcp_KBps)) as "KB/s", max(tcp_eps) as "Events/s"
It's not a very good practice to set maxKbps to no limit at all. In case of a sudden unexpected peak you might clog your pipeline on indexers. So it might be reasonable to set this at a relatively high, but still fixed value.
The problem may be the higher volume of windows events to read during the business hours.
The modular inputs doing the collection may be hammering the windows API or waiting for it to respond.
see https://docs.splunk.com/Documentation/Splunk/latest/Data/MonitorWindowseventlogData
Try to reduce the collection by adding whitelist and blacklist on the forwarder inputs.conf.
Maybe some verbose eventcodes are not useful to collect for you and may reduce the volume.
see if your input is not waiting for the AD server to resolve the objects names.
check if you need to disable evt_resolve_ad_obj, or ensure that you are querying the closest/fastest AD evt_dc_name
by default, the forwarder may be querying a remote busy AD.
I checked that the events are visible in a timely manner under forwarded events, and I can see that they are arriving at the Windows Event Collector ready to be forwarded. So i know that the latency is after they have reached the windows event collector.
Hey @davidwaugh, are you running a distributed setup? If so, what does your index cluster look like?
Yep it's distributed. We have 4 indexers and the Universal Forwarders forward to all Indexers on a round robin time basis.
Is your Splunk Environment on-prem, hybrid, or in the cloud ?
I'm not sure the difference in network lag is that great. Doing a ping on each site between the windows event collectors and the indexers, then I am getting ping time of 1ms or less on both windows event collectors.
How many systems are forwarding the events to your Windows Event Collector ?
How many subscriptions do you have set up ?
Thank you. I think you are pointing me in the right direction. On the event collector that has a high latency, there is an additional subscription that I had forgotten about.
I have disabled this subscription to see if it makes a difference.
Hi yes, it was this single subscription that was the cause of the issue. Do you want to enter it as an answer below so that I can make it as correct.
Thanks very much for your help.
David
What was special about this subscription ?
How was it configured ?
Your answers will greatly help the community.
Sorry Your right.
Whatr is special about this subscriptiopn, is that it collects from a single computer.
This single computer that it collects from is itself an event collector for messages from a certain application.
About 8000 computers communicate into this event collector, but i can see the messages arriving constantly in my own event collector.
The order of the messages though is not in a nice timely order. A machine may be offline or not sending messages for sometimes, and so it's messages will then be sent all at once, so I might suddenly see events from a few days ago at the top of my event collector log as they have jsut come in.
I've got a ticket open with Splunk support who are helping me investigate. Normally with a log (or event windows events) you would expect them to come in in a nice timely order. I'm wondering if the events coming in from anytime over the last few days or weeks is causing the issue?
Is there any latency in the events that you have coming into Splunk from your WEC(s) ?
In our experience, 8,000 machines is too much for Splunk from a single WEC.
Are you using the UF or HF on your WEC ?
Or do you have your logs going from the WEC(UF) ----- HF------ Indexer ?
I'm curious about your architecture, because we have something similar.
Currently I have 6 WEC's - 2 for Live, 2 for DMZ and 1 each for UAT and DEV.
The live ones are the busiest by far and I just have most stuff going to WEC01 in Live, with now only the special subscription mentioned above going to WEC02. In terms of volume I would say that there is more volume on WEC01 as this is taking all the events from the domain controllers.
There are universal forwarders on all WEC's. which send to a distributed environent of Indexers. There are no heavy forwarders between UF and Indexers.
Windows events are going into their own index. When searching this index, only events from WEC02 are showing as delayed.
How bad is the delay on Windows Events coming from WEC02 ?
It gets up to about 20,000 seconds. eg over 8 hours