Solved: Splunk UF Windows Security Event logs just seem to...

PeterBoard · ‎09-04-2024

Experiencing an issue on a few random servers, some Domain Controllers and some Member Servers. Windows Security Event logs just seem to randomly stop sending. If I restart the Splunk UF, then the event logs start gathering. We are using the Splunk UF 9.1.5, but I also noticed this issue on Splunk UF 9.1.4. I thought it had been corrected when we upgraded to Splunk UF 9.1.5, but its re-appeared - most recent occurrence seemed to occur roughly about 3 weeks ago on 15 servers across multiple clients we manage. This unfortunately has resulted in the loss of data for a few weeks as the local event logs eventually got discarded as the data filled up.

I have now written a Splunk Alert to notify us each day if any servers are in this situation (compares the Windows servers reporting into two different indexes, one index is for Windows Security Event logs), so we can more easily spot the issue. We are just raising a case with Splunk support today about the issue.

PeterBoard · ‎10-02-2024

So working with Splunk on this issue, it came down to two issues with the Splunk UF, the way it currently is designed and does things.

Firstly, when the Splunk Service is starting, if it can't get a response from the Event Log within 30 seconds, it stops trying to collect Windows Events until the service is restarted. I have found that this can happen at times when a server is rebooted and applying patches, on the final reboot, it can be delayed. At this point Splunk Service starts, but if it times out, then you'll get no data collection of the Windows Event logs, as there is currently no auto retry function, if it doesn't respond in 30 seconds. The workaround is to change the Splunk UF service to Automatic Delayed start, to try overcome this issue.

The second issue is to do with the Windows Event Log capture directive evt_resolve_ad_obj=1. If for some reason the Splunk UF needs to resolve an AD SID, that is not cached already, and the resolving of the SID times out - maybe say a Domain Controller was rebooting at the time of a resolve, or something like that, then the Splunk UF will stop capturing any more Events for that Event Log until the Splunk UF is restarted, once more it doesn't auto retry, or continue on to the next Event entry. The work around is to set evt_resolve_ad_obj=0 - so it doesn't try resolve any unknown SID's. You won't know this has occurred unless you are monitoring your data sets in the indexes for each host, checking to see if the Event Log data is arriving or not.

Splunk informed us that the behaviors we are experiencing are due to the current design of the product. To fix these, it would come under enhancement requests. The case technician has submitted two feature requests on our behalf:

1. EID-I-2424: Implement a retry mechanism or allow configurable timeout settings to address the 30-second initialization timeout for Windows event log data collection in Splunk Universal Forwarder.

https://ideas.splunk.com/ideas/EID-I-2424

2. EID-I-2425: Enhance the `evt_resolve_ad_obj=1` setting to skip or retry unresolved Security Identifiers (SIDs) instead of halting event log collection when SID resolution fails.

https://ideas.splunk.com/ideas/EID-I-2425

View solution in original post

PeterBoard · ‎10-02-2024

So working with Splunk on this issue, it came down to two issues with the Splunk UF, the way it currently is designed and does things.

Firstly, when the Splunk Service is starting, if it can't get a response from the Event Log within 30 seconds, it stops trying to collect Windows Events until the service is restarted. I have found that this can happen at times when a server is rebooted and applying patches, on the final reboot, it can be delayed. At this point Splunk Service starts, but if it times out, then you'll get no data collection of the Windows Event logs, as there is currently no auto retry function, if it doesn't respond in 30 seconds. The workaround is to change the Splunk UF service to Automatic Delayed start, to try overcome this issue.

The second issue is to do with the Windows Event Log capture directive evt_resolve_ad_obj=1. If for some reason the Splunk UF needs to resolve an AD SID, that is not cached already, and the resolving of the SID times out - maybe say a Domain Controller was rebooting at the time of a resolve, or something like that, then the Splunk UF will stop capturing any more Events for that Event Log until the Splunk UF is restarted, once more it doesn't auto retry, or continue on to the next Event entry. The work around is to set evt_resolve_ad_obj=0 - so it doesn't try resolve any unknown SID's. You won't know this has occurred unless you are monitoring your data sets in the indexes for each host, checking to see if the Event Log data is arriving or not.

Splunk informed us that the behaviors we are experiencing are due to the current design of the product. To fix these, it would come under enhancement requests. The case technician has submitted two feature requests on our behalf:

1. EID-I-2424: Implement a retry mechanism or allow configurable timeout settings to address the 30-second initialization timeout for Windows event log data collection in Splunk Universal Forwarder.

https://ideas.splunk.com/ideas/EID-I-2424

2. EID-I-2425: Enhance the `evt_resolve_ad_obj=1` setting to skip or retry unresolved Security Identifiers (SIDs) instead of halting event log collection when SID resolution fails.

https://ideas.splunk.com/ideas/EID-I-2425

dural_yyz · ‎09-09-2024

If you have a large organization with a large number of identities on your AD you will want to consider reviewing default cache size. Increasing the cache size will help prevent the additional CPU cycles to replace the Windows Unique ID with a Human Readable format.

evt_ad_cache_disabled = <boolean>
* Enables or disables the AD object cache.
* Default: false (enabled)

evt_ad_cache_exp = <integer>
* The expiration time, in seconds, for AD object cache entries.
* This setting is optional.
* Default: 3600 (1 hour)

evt_ad_cache_exp_neg = <integer>
* The expiration time, in seconds, for negative AD object cache entries.
* This setting is optional.
* Default: 10

evt_ad_cache_max_entries = <integer>
* The maximum number of AD object cache entries.
* This setting is optional.
* Default: 1000

PeterBoard · ‎09-11-2024

Thanks for the suggestion. I don't think the Domain Controllers from two different client setups out of a number of clients that we look after would be considered a large environment - it's not been every DC, just a few random ones. One setup has about 300 workstations / accounts talking to a DC, another is a management zone, with a 100 or so accounts and a small quantity of servers for providing services.

Zacknoid · ‎09-09-2024

Hi Peter,

Could you please check event Queue -->

Event Queue Backlog: Check if event queues on the forwarders are building up (seen in metrics.log). This can happen if there's too much data being processed at once.
Another thing to monitor is the network, during the logs stop any changes on the network utilization ( both receivers side forwarder's end )

Also ensure the following inputs on the forwarder side ( this worked in my case, but results may vary in your setup )

useACK=false

autoBatch=false

PeterBoard · ‎09-11-2024

Hi Zack,

So I checked with our team that manages our indexers / heavy forwarders / Splunk backend. I also checked the metrics.log on a server we are using in our Splunk support case, and couldn't see any queues building up in the metrics.log - plus the sample server we are using (an SQL member server), doesn't really have a high level of traffic. During the period that the Security logs aren't sending, I can see data still coming in from the Windows Application Event Log, other Windows Event logs (like App Locker event logs, SMB auditing event logs) - so Event Log data is coming, but just not from the Security Log in the error periods. A restart of the UF causes it to re-process anything that is still local in the Security Event Log.

We had an older case sort of like this for the Windows Defender Antivirus event logs - not capturing data - the outcome - Splunk added a new directive - channel_wait_time= - to cause the Splunk UF to retest the Event Log existed after not being able to access for a time period, and this would cause the data to start recapturing. It could be a similar directive needs to be added - but its not been required during the many years we have had splunk running.

Recently they changed on the indexers on advice from Splunk from an ongoing case - about another issue - so that bit is set as you mentioned

useACK = false

* A value of "false" means the forwarder consider s the data fully processed when it finishes writing it to the network socket.

in our setup, they currently have - they mentioned the value of false is a legacy behaviour.

autoBatch = true

* When set to 'true', the forwarder automatically sends chunks/events in batches to target receiving instance connection. The forwarder creates batches only if there are two or more chunks/events available in output connection queue.

* When set to 'false', the forwarder sends one chunk/event to target receiving instance connection. This is old legacy behavior.

* Default: true

SanjayReddy · ‎09-05-2024

Hi @PeterBoard

Infact recently we faced same issue for domain controller server where UF stopped sending data
found that ququs filled up.

as per support they asked to change to useACK to False to aviod issue and they said it not recomneded to Use useack= true on UF.

in your case any errors your obeserved in splunkd.log during issue

PeterBoard · ‎09-08-2024

Looking at our inputs.conf setup via a "splunk btool inputs list --debug", I can't see that we have the useack=true set (if that's where you are referring to). We are capturing a range of Event ID's for reporting as below.

It all works fine however after restarting the Splunk UF when the issue occurs

\etc\apps\inputs_oswin_secevtlog\local\inputs.conf [WinEventLog://Security]
\etc\apps\Splunk_TA_windows\default\inputs.conf blacklist1 = EventCode="4662" Message="Object Type:(?!\s*groupPolicyContainer)"
\etc\apps\Splunk_TA_windows\default\inputs.conf blacklist2 = EventCode="566" Message="Object Type:(?!\s*groupPolicyContainer)"
\etc\apps\inputs_oswin_secevtlog\local\inputs.conf checkpointInterval = 5
\etc\apps\Splunk_TA_windows\default\inputs.conf current_only = 0
\etc\apps\inputs_oswin_secevtlog\local\inputs.conf disabled = 0
\etc\apps\inputs_oswin_secevtlog\local\inputs.conf evt_resolve_ad_obj = 1
\etc\apps\inputs_oswin_secevtlog\local\inputs.conf index = win-securityeventlog
\etc\system\default\inputs.conf interval = 60
\etc\apps\Splunk_TA_windows\default\inputs.conf renderXml = true
\etc\apps\Splunk_TA_windows\default\inputs.conf start_from = oldest
\etc\apps\inputs_oswin_secevtlog\local\inputs.conf whitelist1 = EventCode=%^(104|1102)$%
\etc\apps\inputs_oswin_secevtlog\local\inputs.conf whitelist2 = EventCode=%^(2004|2006|2033)$%
\etc\apps\inputs_oswin_secevtlog\local\inputs.conf whitelist3 = EventCode=%^(33205)$%
\etc\apps\inputs_oswin_secevtlog\local\inputs.conf whitelist4 = EventCode=%^(4170|4624|4625|4634|4647|4648|4663|4673|4688|4719|4720|4722|4723|4724|4725|4726|4728|4732|4735|4738|4740|4742|4743|4756|4767|4768|4771|4778|4779|4781|4820)$%
\etc\apps\inputs_oswin_secevtlog\local\inputs.conf whitelist5 = EventCode=%^(517|528|529|538|540|551|552|592|5152|5157)$%
\etc\apps\inputs_oswin_secevtlog\local\inputs.conf whitelist6 = EventCode=%^(624|627|628|642|644|680|6279)$%
\etc\apps\inputs_oswin_secevtlog\local\inputs.conf whitelist7 = EventCode=%^(7045)$%
\etc\apps\inputs_oswin_secevtlog\local\inputs.conf whitelist8 = TaskCategory=%^Network Policy Server$%

Splunk UF Windows Security Event logs just seem to randomly stop sending

universal forwarder

useACK=false

autoBatch=false

What's New in Splunk Enterprise 9.4: Features to Power Your Digital Resilience

Take Your Breath Away with Splunk Risk-Based Alerting (RBA)

SignalFlow: What? Why? How?