Getting Data In

Extreme Latency with Windows Events on one Windows Event Collector. How do I troubleshoot?

davidwaugh
Path Finder

Hello i have two windows event collectors. 3 domain controllers send their events to one event collector (WEC01), and three send their events to another event collector.(WEC02)

From 8.00 onwards (eg the start of the working day) the events from WEC02 are getting progressively delayed up to about 20,000 seconds behind, before eventually catching up by about 4AM in the morning.

Both systems have the same configurations on them, which are managed by a deployment server.

I have looked at:
https://answers.splunk.com/answers/224727/why-is-my-universal-forwarder-showing-extreme-lag.html?utm...

And various other posts and have the following set:

limits.conf

[thruput]
maxKBps = 0

Outputs.conf

alt text

There doesnt appear to be any blockage in terms of indexer queues as other events are indexed fine and there is no latency. CPU, Memory and Network is all fine on the virtual machine. I can see no obvious reason why there is a delay.

Both Windows Event collectors are virtual machines. They may be on different physical hosts. There is a difference in latency in packets between the two hosts.

Here is a screenshot from the resouce monitor, network activity.

Slow Windows Event Collector (High Latency)

alt text

Fast Windows Event Collector (low latency)

dm1
Contributor

were you able to find any fix to this ? if yes, please share.

0 Karma

PickleRick
SplunkTrust
SplunkTrust

It's a three years old thread so the people might not even be active on this community anymore.

Having said that - I scrolled through the whole thread and I don't think anyone mentioned checking the throughput limit. If it's too low it might be causing this queue buildup. Since forwarder is not able to send events as fast as it's reading them.

0 Karma

dm1
Contributor

-evt_resolve_ad_obj is set to 0 in inputs.conf
-maxKbps is set to 0 in limits.conf

These settings fixed it for me

0 Karma

edoardo_vicendo
Contributor

Exactly, I confirm in our environment the issue was mainly due to thruput constraint, but we put 25 MB/s instead of unlimited. The only strange thing that didn't allowed us to quickly identify the root cause was that, for the windows events locally generated by the WEC server itself, the Splunk Universal Forwarder had no delay collecting them. The only delay was observed on forwarding with the Splunk Universal Forwarder the events stored by the Windows Event Collector (WEC) coming from the other machines through Windows Event Forwarding (WEF).

 

limits.conf

[thruput]
maxKBps = 25600

 

 

To understand the thruput limit in your environment you can use this query (stay quite higher than the maximum you observe)

 

index=_internal sourcetype=splunkd group=tcpin_connections (connectionType=cooked OR connectionType=cookedSSL) hostname=your_WEC_host
| timechart minspan=30s max(eval(tcp_KBps)) as "KB/s", max(tcp_eps) as "Events/s"

 

 

0 Karma

PickleRick
SplunkTrust
SplunkTrust

It's not a very good practice to set maxKbps to no limit at all. In case of a sudden unexpected peak you might clog your pipeline on indexers. So it might be reasonable to set this at a relatively high, but still fixed value.

0 Karma

yannK
Splunk Employee
Splunk Employee

The problem may be the higher volume of windows events to read during the business hours.
The modular inputs doing the collection may be hammering the windows API or waiting for it to respond.

see https://docs.splunk.com/Documentation/Splunk/latest/Data/MonitorWindowseventlogData

  • Try to reduce the collection by adding whitelist and blacklist on the forwarder inputs.conf.
    Maybe some verbose eventcodes are not useful to collect for you and may reduce the volume.

  • see if your input is not waiting for the AD server to resolve the objects names.
    check if you need to disable evt_resolve_ad_obj, or ensure that you are querying the closest/fastest AD evt_dc_name
    by default, the forwarder may be querying a remote busy AD.

0 Karma

davidwaugh
Path Finder

I checked that the events are visible in a timely manner under forwarded events, and I can see that they are arriving at the Windows Event Collector ready to be forwarded. So i know that the latency is after they have reached the windows event collector.

0 Karma

dillardo_2
Path Finder

Hey @davidwaugh, are you running a distributed setup? If so, what does your index cluster look like?

0 Karma

davidwaugh
Path Finder

Yep it's distributed. We have 4 indexers and the Universal Forwarders forward to all Indexers on a round robin time basis.

0 Karma

itrimble1
Path Finder

Is your Splunk Environment on-prem, hybrid, or in the cloud ?

0 Karma

davidwaugh
Path Finder

I'm not sure the difference in network lag is that great. Doing a ping on each site between the windows event collectors and the indexers, then I am getting ping time of 1ms or less on both windows event collectors.

0 Karma

itrimble1
Path Finder

How many systems are forwarding the events to your Windows Event Collector ?
How many subscriptions do you have set up ?

davidwaugh
Path Finder

Thank you. I think you are pointing me in the right direction. On the event collector that has a high latency, there is an additional subscription that I had forgotten about.

I have disabled this subscription to see if it makes a difference.

0 Karma

davidwaugh
Path Finder

Hi yes, it was this single subscription that was the cause of the issue. Do you want to enter it as an answer below so that I can make it as correct.

Thanks very much for your help.
David

0 Karma

itrimble1
Path Finder

What was special about this subscription ?

How was it configured ?
Your answers will greatly help the community.

0 Karma

davidwaugh
Path Finder

Sorry Your right.

Whatr is special about this subscriptiopn, is that it collects from a single computer.
This single computer that it collects from is itself an event collector for messages from a certain application.

About 8000 computers communicate into this event collector, but i can see the messages arriving constantly in my own event collector.

The order of the messages though is not in a nice timely order. A machine may be offline or not sending messages for sometimes, and so it's messages will then be sent all at once, so I might suddenly see events from a few days ago at the top of my event collector log as they have jsut come in.

I've got a ticket open with Splunk support who are helping me investigate. Normally with a log (or event windows events) you would expect them to come in in a nice timely order. I'm wondering if the events coming in from anytime over the last few days or weeks is causing the issue?

0 Karma

itrimble1
Path Finder

Is there any latency in the events that you have coming into Splunk from your WEC(s) ?
In our experience, 8,000 machines is too much for Splunk from a single WEC.

Are you using the UF or HF on your WEC ?

Or do you have your logs going from the WEC(UF) ----- HF------ Indexer ?

I'm curious about your architecture, because we have something similar.

davidwaugh
Path Finder

Currently I have 6 WEC's - 2 for Live, 2 for DMZ and 1 each for UAT and DEV.

The live ones are the busiest by far and I just have most stuff going to WEC01 in Live, with now only the special subscription mentioned above going to WEC02. In terms of volume I would say that there is more volume on WEC01 as this is taking all the events from the domain controllers.

There are universal forwarders on all WEC's. which send to a distributed environent of Indexers. There are no heavy forwarders between UF and Indexers.

Windows events are going into their own index. When searching this index, only events from WEC02 are showing as delayed.

itrimble1
Path Finder

How bad is the delay on Windows Events coming from WEC02 ?

0 Karma

davidwaugh
Path Finder

It gets up to about 20,000 seconds. eg over 8 hours

0 Karma
Get Updates on the Splunk Community!

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...