Getting Data In

Extreme Latency with Windows Events on one Windows Event Collector. How do I troubleshoot?

davidwaugh
Path Finder

Hello i have two windows event collectors. 3 domain controllers send their events to one event collector (WEC01), and three send their events to another event collector.(WEC02)

From 8.00 onwards (eg the start of the working day) the events from WEC02 are getting progressively delayed up to about 20,000 seconds behind, before eventually catching up by about 4AM in the morning.

Both systems have the same configurations on them, which are managed by a deployment server.

I have looked at:
https://answers.splunk.com/answers/224727/why-is-my-universal-forwarder-showing-extreme-lag.html?utm...

And various other posts and have the following set:

limits.conf

[thruput]
maxKBps = 0

Outputs.conf

alt text

There doesnt appear to be any blockage in terms of indexer queues as other events are indexed fine and there is no latency. CPU, Memory and Network is all fine on the virtual machine. I can see no obvious reason why there is a delay.

Both Windows Event collectors are virtual machines. They may be on different physical hosts. There is a difference in latency in packets between the two hosts.

Here is a screenshot from the resouce monitor, network activity.

Slow Windows Event Collector (High Latency)

alt text

Fast Windows Event Collector (low latency)

yannK
Splunk Employee
Splunk Employee

The problem may be the higher volume of windows events to read during the business hours.
The modular inputs doing the collection may be hammering the windows API or waiting for it to respond.

see https://docs.splunk.com/Documentation/Splunk/latest/Data/MonitorWindowseventlogData

  • Try to reduce the collection by adding whitelist and blacklist on the forwarder inputs.conf.
    Maybe some verbose eventcodes are not useful to collect for you and may reduce the volume.

  • see if your input is not waiting for the AD server to resolve the objects names.
    check if you need to disable evt_resolve_ad_obj, or ensure that you are querying the closest/fastest AD evt_dc_name
    by default, the forwarder may be querying a remote busy AD.

0 Karma

davidwaugh
Path Finder

I checked that the events are visible in a timely manner under forwarded events, and I can see that they are arriving at the Windows Event Collector ready to be forwarded. So i know that the latency is after they have reached the windows event collector.

0 Karma

dillardo_2
Path Finder

Hey @davidwaugh, are you running a distributed setup? If so, what does your index cluster look like?

0 Karma

davidwaugh
Path Finder

Yep it's distributed. We have 4 indexers and the Universal Forwarders forward to all Indexers on a round robin time basis.

0 Karma

itrimble1
Path Finder

Is your Splunk Environment on-prem, hybrid, or in the cloud ?

0 Karma

davidwaugh
Path Finder

I'm not sure the difference in network lag is that great. Doing a ping on each site between the windows event collectors and the indexers, then I am getting ping time of 1ms or less on both windows event collectors.

0 Karma

itrimble1
Path Finder

How many systems are forwarding the events to your Windows Event Collector ?
How many subscriptions do you have set up ?

davidwaugh
Path Finder

Thank you. I think you are pointing me in the right direction. On the event collector that has a high latency, there is an additional subscription that I had forgotten about.

I have disabled this subscription to see if it makes a difference.

0 Karma

davidwaugh
Path Finder

Hi yes, it was this single subscription that was the cause of the issue. Do you want to enter it as an answer below so that I can make it as correct.

Thanks very much for your help.
David

0 Karma

itrimble1
Path Finder

What was special about this subscription ?

How was it configured ?
Your answers will greatly help the community.

0 Karma

davidwaugh
Path Finder

Sorry Your right.

Whatr is special about this subscriptiopn, is that it collects from a single computer.
This single computer that it collects from is itself an event collector for messages from a certain application.

About 8000 computers communicate into this event collector, but i can see the messages arriving constantly in my own event collector.

The order of the messages though is not in a nice timely order. A machine may be offline or not sending messages for sometimes, and so it's messages will then be sent all at once, so I might suddenly see events from a few days ago at the top of my event collector log as they have jsut come in.

I've got a ticket open with Splunk support who are helping me investigate. Normally with a log (or event windows events) you would expect them to come in in a nice timely order. I'm wondering if the events coming in from anytime over the last few days or weeks is causing the issue?

0 Karma

itrimble1
Path Finder

Is there any latency in the events that you have coming into Splunk from your WEC(s) ?
In our experience, 8,000 machines is too much for Splunk from a single WEC.

Are you using the UF or HF on your WEC ?

Or do you have your logs going from the WEC(UF) ----- HF------ Indexer ?

I'm curious about your architecture, because we have something similar.

davidwaugh
Path Finder

Currently I have 6 WEC's - 2 for Live, 2 for DMZ and 1 each for UAT and DEV.

The live ones are the busiest by far and I just have most stuff going to WEC01 in Live, with now only the special subscription mentioned above going to WEC02. In terms of volume I would say that there is more volume on WEC01 as this is taking all the events from the domain controllers.

There are universal forwarders on all WEC's. which send to a distributed environent of Indexers. There are no heavy forwarders between UF and Indexers.

Windows events are going into their own index. When searching this index, only events from WEC02 are showing as delayed.

itrimble1
Path Finder

How bad is the delay on Windows Events coming from WEC02 ?

0 Karma

davidwaugh
Path Finder

It gets up to about 20,000 seconds. eg over 8 hours

0 Karma

itrimble1
Path Finder

Any difference in the configuration of WEC02 from a collector or UF configuration?
Is the volume of Events the same for WEC02?

0 Karma

davidwaugh
Path Finder

Nope, if anything there are fewer events on WEC02 than there are on WEC01.

0 Karma

itrimble1
Path Finder

Have you checked your indexers for congestion ? Have you checked the parsingQueue or the indexQueue ?

davidwaugh
Path Finder

Yep no congestion on the indexers. For instance at the same time I am ingesting syslog events and the delay for these is only a few seconds.

As far as I'm ware if this was an indexer problem, then all indexes would should as being behind, and not just one index, and not from only one forwarder.

0 Karma

itrimble1
Path Finder

How are is your ForwardedEvents Stanza configured?

[WinEventLog://ForwardedEvents]
sourcetype = WinEventLog:ForwardedEvents
disabled = 0
#start_from = oldest
current_only = 1

evt_resolve_ad_obj = 1
checkpointInterval = 5

Have you tried to change start_from to newest, restart, then switch it back to oldest ?

0 Karma