I have 40 Windows 2012 domain controllers (forwarding through heavy forwarders to cloud), that intermittently stop sending "WinEventLog:Security" events to cloud indexers. In some cases, one of the servers will send Security events for a few hours and then stop sending altogether. I know the events exist on the server because I can see them through Event Viewer. On the other hand, I don't have the same issue with the Application or System events. They flow all the time. The issue only happens with "WinEventLog:Security" events.
So far, I have tried to split the load among 4 heavy forwarders, thinking it was a forwarder congestion issue. I also configured the domain controllers to send directly cloud, bypassing the heavy forwarders. Alas, no success.
Has anyone experienced or heard about this issue? Thank you.
The reason I mentioned delayed is you are having problem only on WinEventLog:Security events. Since the rest are flowing fine it can not be congestion or thruput problem.
If you run your above search again do you see increase on values? If yes there is delay , if not they are stopped.
I think it is better for you to create a support case.
@scelikok Unfortunately running the search again does not increase the values. I opened a ticket a few months ago on this issue and they recommended the changes below with no success. Their final recommendation, was to reboot the Windows servers once a week or upgrade from 2012 R2 to 2019 or newer. I will re-open the case and request more help.
1. Change evt_resolve_ad_obj = 1 (change to 0)
2. Increase the number of pipelines to handle incoming data. Number of cpus on host minus 1. In my case I have 9 pipelines.
3. Modify outputs.conf [tcpout] stanza to:
autoLBFrequency = 180
forceTimebasedAutoLB = false
autoLBVolume = 5000000
You must restart UF service on those servers.
@scelikok I'm still seeing the same issue on most hosts as you can see below. You mentioned that the events are delayed and not dropped. Is that a good assumption? Also, I'm sharing my query in case this would be helpful.
I should mention that in addition to making these changes, we spun up 3 additional HFs thinking it was a congestion issue. But, we are seeing the same behavior across those HFs as well. Your help is appreciated.
index=windows_ad source="WinEventLog:Security" host IN (host1 host2 host3) | timechart count by host span=1h limit=50
In your config, there is a current_only setting twice which is 1 actual. This may cause missing events when your restart the forwarder service or host. Please keep this as current_only=0.
Please try below setting (cache settings)
[WinEventLog://Security] use_old_eventlog_api = true disabled = 0 start_from = oldest current_only = 0 evt_resolve_ad_obj = 1 checkpointInterval = 5 blacklist1 = EventCode="4662" Message="Object Type:(?!\s*groupPolicyContainer)" blacklist2 = EventCode="566" Message="Object Type:(?!\s*groupPolicyContainer)" renderXml = true index = my_windows_ad evt_ad_cache_exp = 1200 evt_ad_cache_exp_neg = 1200 evt_ad_cache_max_entries = 40000 evt_sid_cache_exp = 300 evt_sid_cache_exp_neg = 300 evt_sid_cache_max_entries = 4000 evt_dc_name = localhost
If you still have a delay you may have another problem. It is better to open a support case.
evt_resolve_ad_obj = 0 will stop SID resolution. You will not able to see usernames in the logs.
Please test only use_old_eventlog_api = true
@scelikok changed evt_resolve_ad_obj back to 1.
Since changing use_old_eventlog_api to true. I still see the logs delayed/missing. I am including my stanza for this source. Let me know if you see anything that can be improved. I'm surprised this isn't a bigger deal with Splunk. I haven't seen any know bug articles or bulletins for this issue. Thank you.
use_old_eventlog_api = true
disabled = 0
start_from = oldest
current_only = 0
evt_resolve_ad_obj = 1
checkpointInterval = 5
blacklist1 = EventCode="4662" Message="Object Type:(?!\s*groupPolicyContainer)"
blacklist2 = EventCode="566" Message="Object Type:(?!\s*groupPolicyContainer)"
renderXml = true
index = my_windows_ad
current_only = 1
@scelikok Thanks for the suggestion. I added your fix this morning as well as
evt_resolve_ad_obj = 1 (change to 0)
Suggested by another splunker. I'll check back tomorrow. Out of curiosity, does a 40 domain controller environment seem too large. Any other ideas how to limit the traffic from this source?
Splunk Universal Frowarder resolves SID to username for WinEventLog:Security logs by querying the nearest DC. If your DCs are busy, this resolution takes more time and causes delays. If you check the logs they should be coming but are delayed. If this is the case you can try adding below parameter to use old event log API for resolution.
[WinEventLog://Security] use_old_eventlog_api = true
@scelikok Wanted to update you on the resolution. As it turns out editing the limits.conf file directly in the app solved my issue. Initially, it was set to the default maxKBps=256. I set it to 0 using the settings below. This seemed to solve the issue and now I'm receiving all my events. The new setting in the limits.conf file is
maxKBps = 0
Thanks for all your help!
My confusion is why only Security events are affected. The first thing to check should have been the "thruput" setting but since your system events were working alright we didn't consider that option.
Anyway, very nice to hear it is resolved.