we ran in some problem with our Universal Forwarder (version 6.5.0.) which collects event logs from our root DC in the testing environment.
So, we had several issues, but limited those to one issue left - our forwarder stops forwarding Windows Security eventlog data. _internal is coming just fine.
We have read through many threads here and found no solution for this.
First of all, the latest inputs.conf:
[WinEventLog://Security] disabled = 0 index = t_active_directory_60 sourcetype = windows_security batch_size = 20 start_from = newest evt_dc_name = xyz evt_resolve_ad_obj = 0 checkpointInterval = 60
We tried different things here. Setting batch_size makes no difference. Setting evt_resolve_ad_obj to 1 sends no data at all (no _internal either).
Then, today, we finally got an interesting error we've never seen before:
02-09-2017 13:42:14.676 +0100 ERROR ExecProcessor - message from ""C:\Program Files\SplunkUniversalForwarder\bin\splunk-winevtlog.exe"" splunk-winevtlog - WinEventLogChannel::queryEvtChannel: Unable to set seek position to the given bookmark
And this ones keeps coming up every time we restart the forwarder:
02-09-2017 13:41:54.231 +0100 ERROR Metrics - Metric with name thruput:idxSummary already registered
Also, we saw the following warning the first time today:
02-09-2017 13:44:14.197 +0100 WARN TcpOutputProc - Pipeline data does not have indexKey. [_path] = C:\Program Files\SplunkUniversalForwarder\bin\splunk-winevtlog.exe\n[_raw] = \n[_stmid] = Pv7LDc2XW3JCugFC\n[MetaData:Source] = source::WinEventLog\n[MetaData:Host] = host::XYZ\n[MetaData:Sourcetype] = sourcetype::WinEventLog\n[_done] = _done\n[_conf] = source::WinEventLog|host::XYZ|WinEventLog|\n
Does anyone have any ideas on this one?
Our outputs.conf for reference:
[tcpout] indexAndForward = false defaultGroup = HEAVY_FORWARDER [tcpout:HEAVY_FORWARDER] server = HEAVY_FORWARDER:9997 sendCookedData = true sslPassword = ... clientCert = C:\Program Files\SplunkUniversalForwarder\etc\auth\abc.pem sslRootCAPath = C:\Program Files\SplunkUniversalForwarder\etc\auth\abc.pem sslVerifyServerCert = true useClientSSLCompression = true useACK = true
Also, a funny side note: useACK should have no affect here. But as soon as we set useACK to false, we get duplicate Windows Security events (same record numbers three times). Setting sendCookedData to false also sends no data at all.
Any help is appreciated.
It looks to me like there is a zombie splunk process running. I would stop splunk in the process manager, then go through and manually kill any splunk processes that you find in the task manager, then restart splunk process.
I have had a similar issue and found the following had to be done...
Increase the TCP input queue on the indexers.
Increase the thruput setting on the UF
Increase the TCP output queue on the UF.
Check for any other blocked queues in your deployment.
Check the _indextime vs _time for events and make sure this is a steady number of seconds and is small.
You will also have to make sure you have the performance in your DC. If the DC is virtual, look a the CPU COStop value to see if you are really getting CPU time scheduled for your system.
I asked our Splunk REP if parallelIngestionPipelines would help in this case since all of the events are coming from one source Wineventlog://Security. No answer yet.
I wouldn't change the sourcetype in the UF as the correct sourcetype will be done the Windows TA in your indexer
can u have a try like.
[WinEventLog://Security] disabled = 0 start_from = oldest current_only = 0 evt_resolve_ad_obj = 1 checkpointInterval = 10 blacklist1 = EventCode="4662" Message="Object Type:\s+(?!groupPolicyContainer)" blacklist2 = EventCode="566" Message="Object Type:\s+(?!groupPolicyContainer)" index = t_active_directory_60 renderXml=false
sorry for the late answer and thanks for your comments so far. Yes, we are using Windows Server 2012.
We have not modified the limits.conf yet, but we will try that when we run into this issue again. Right now, we have completely uninstalled the 6.5 forwarder on the root DC and installed a 6.4 forwarder on another DC and there are no issues right now (without tuning any settings in limits.conf).
We have only modified the checkpointInterval because it was suggested in another thread. With our working installation, it is back to the standard value now.
However, thanks for the suggestions. With our root DC getting a fresh installation next week (which gets more events than the other DC), we will try to tune the settings in limits.conf if we run into those problems again.
Automatic eventlog backups should be no problem, they aren't running that often, as far as I've seen.
I will post an update next week if the problems are gone then.
Edit: And yes, the forwarder stops completely to collect eventlog data. It resends them as soon as it gets restarted.
is it possible the AD logs are rolling off the server before Splunk reads them fully? What is the log retention like on your test AD?
Have you tuned the thruput limits on the forwarder? Generally you will need to ensure the forwarder can keep up with a busy machine. Make sure to up this value in limits.conf, UF defaults to 256KB, you need something higher on AD for sure maybe start with 1024?:
[thruput] maxKBps = <integer> * If specified and not zero, this limits the speed through the thruput processor in the ingestion pipeline to the specified rate in kilobytes per second. * To control the CPU load while indexing, use this to throttle the number of events this indexer processes to the rate (in KBps) you specify. * Note that this limit will be applied per ingestion pipeline. For more information about multiple ingestion pipelines see parallelIngestionPipelines in the server.conf.spec file. * With N parallel ingestion pipelines the thruput limit across all of the ingestion pipelines will be N * maxKBps.
Also, I see you changed the default checkpoint interval, what was the idea behind that?
checkpointInterval = <integer> * How often, in seconds, that the Windows Event Log input saves a checkpoint. * Checkpoints store the eventID of acquired events. This lets the input continue monitoring at the correct event after a shutdown or outage. * The default value is 5.
Also, when you say it stops. Does it stop completely, or is there gaps in the collection?