Hello i have two windows event collectors. 3 domain controllers send their events to one event collector (WEC01), and three send their events to another event collector.(WEC02)
From 8.00 onwards (eg the start of the working day) the events from WEC02 are getting progressively delayed up to about 20,000 seconds behind, before eventually catching up by about 4AM in the morning.
Both systems have the same configurations on them, which are managed by a deployment server.
And various other posts and have the following set:
maxKBps = 0
There doesnt appear to be any blockage in terms of indexer queues as other events are indexed fine and there is no latency. CPU, Memory and Network is all fine on the virtual machine. I can see no obvious reason why there is a delay.
Both Windows Event collectors are virtual machines. They may be on different physical hosts. There is a difference in latency in packets between the two hosts.
Here is a screenshot from the resouce monitor, network activity.
Slow Windows Event Collector (High Latency)
Fast Windows Event Collector (low latency)
The problem may be the higher volume of windows events to read during the business hours.
The modular inputs doing the collection may be hammering the windows API or waiting for it to respond.
Try to reduce the collection by adding whitelist and blacklist on the forwarder inputs.conf.
Maybe some verbose eventcodes are not useful to collect for you and may reduce the volume.
see if your input is not waiting for the AD server to resolve the objects names.
check if you need to disable evt_resolve_ad_obj, or ensure that you are querying the closest/fastest AD evt_dc_name
by default, the forwarder may be querying a remote busy AD.
I checked that the events are visible in a timely manner under forwarded events, and I can see that they are arriving at the Windows Event Collector ready to be forwarded. So i know that the latency is after they have reached the windows event collector.
I'm not sure the difference in network lag is that great. Doing a ping on each site between the windows event collectors and the indexers, then I am getting ping time of 1ms or less on both windows event collectors.
Thank you. I think you are pointing me in the right direction. On the event collector that has a high latency, there is an additional subscription that I had forgotten about.
I have disabled this subscription to see if it makes a difference.
Hi yes, it was this single subscription that was the cause of the issue. Do you want to enter it as an answer below so that I can make it as correct.
Thanks very much for your help.
Sorry Your right.
Whatr is special about this subscriptiopn, is that it collects from a single computer.
This single computer that it collects from is itself an event collector for messages from a certain application.
About 8000 computers communicate into this event collector, but i can see the messages arriving constantly in my own event collector.
The order of the messages though is not in a nice timely order. A machine may be offline or not sending messages for sometimes, and so it's messages will then be sent all at once, so I might suddenly see events from a few days ago at the top of my event collector log as they have jsut come in.
I've got a ticket open with Splunk support who are helping me investigate. Normally with a log (or event windows events) you would expect them to come in in a nice timely order. I'm wondering if the events coming in from anytime over the last few days or weeks is causing the issue?
Is there any latency in the events that you have coming into Splunk from your WEC(s) ?
In our experience, 8,000 machines is too much for Splunk from a single WEC.
Are you using the UF or HF on your WEC ?
Or do you have your logs going from the WEC(UF) ----- HF------ Indexer ?
I'm curious about your architecture, because we have something similar.
Currently I have 6 WEC's - 2 for Live, 2 for DMZ and 1 each for UAT and DEV.
The live ones are the busiest by far and I just have most stuff going to WEC01 in Live, with now only the special subscription mentioned above going to WEC02. In terms of volume I would say that there is more volume on WEC01 as this is taking all the events from the domain controllers.
There are universal forwarders on all WEC's. which send to a distributed environent of Indexers. There are no heavy forwarders between UF and Indexers.
Windows events are going into their own index. When searching this index, only events from WEC02 are showing as delayed.
Yep no congestion on the indexers. For instance at the same time I am ingesting syslog events and the delay for these is only a few seconds.
As far as I'm ware if this was an indexer problem, then all indexes would should as being behind, and not just one index, and not from only one forwarder.
How are is your ForwardedEvents Stanza configured?
sourcetype = WinEventLog:ForwardedEvents
disabled = 0
#start_from = oldest
current_only = 1
evt_resolve_ad_obj = 1
checkpointInterval = 5
Have you tried to change start_from to newest, restart, then switch it back to oldest ?