I have a bit of a curious case and it is really bugging our production environment. I have deployed around 12 Windows UF to monitor Security event logs within AD servers which are located behind a firewall. The version of the UFs is 5.0.2 currently and I have set the input and output configurations using a deployment server.
From the first deployment, I could see all 12 servers are sending the logs just fine. After several hours, the number of servers dropped to 7. The drop sequence continue until no server is sending logs at all.
I tried to use just a single server as a test project and I found that the server is only sending logs for about 3 - 4 hours max before stopped sending completely. No errors or warnings found within splunkd.log of the forwarder and my indexer. The splunkd.log's entries were only "Connected to ...." and "... phone home ....". I also did not see any blocking event from metrics.log
My configurations are like this:
[WinEventLog://Security] disabled = 0 index = app_ad sourcetype = tseladscrt start_from = oldest current_only = 0 _TCP_ROUTING = loadheavyfwd
[tcpout:loadheavyfwd] compressed = true server = <indexerip>:9997 sslCertPath = D:\Program Files\SplunkUniversalForwarder\etc\auth\cert.pem sslPassword = xxxxxxxxxxxxx sslRootCAPath = D:\Program Files\SplunkUniversalForwarder\etc\auth\CoreCA.pem sslVerifyServerCert = true
Where should I start to troubleshoot?
When you installed the forwarders, were the boxes checked to start the Windows event log collection, or were those inputs defined for the first time from the deployment server? The known issues for version 5.0.2 mentions a problem in which a restart while an event log is being read via a
[monitor://] stanza in the
inputs.conf file can cause Splunk to abandon reading the file further... If the collection was in place before it was pushed from the deployment server, the deployment server could have triggered a restart that caused it to hit that issue.
The same document also mentions in a different section that in that version the Universal forwarder can sometimes stop forwarding Windows security and application event logs when anti-virus is running on the forwarder but it doesn't give many more details. If you open a support case there's a good chance they'd be able to tell pretty easily if you're hitting one of those and what the workaround is.
I apologize for the late response. Thank you for the explanation. Would you link me to the document mentioned in your answer? Also, will upgrading the forwarder to version >= 6.3 solve the problem? I am planning to upgrade my Splunk environment and I need a justification for this one.
at first a question: why did you use
_TCP_ROUTING = loadheavyfwd ? it's manadatory only in selective forwarding and in your outputs.conf there isn't it.
Then update forwarders because 5.x version will be out od date soon.
After verify if Splunk internal logs continously arrive or not (index=_internal).
I didn't include all of the indexers in this sample of outputs.conf. In real, I have 4 indexers and all of my forwarders are pointing to all 4 indexers.
About the splunk internal logs, I didn't see any error from splunkd.log and the metrics.log also showed more than 0.00 Kbps for raw events. But, the data sometimes stop and after a few hours it's normal again for several hours.
Of course I'm planning to upgrade, but the client needs the justification that upgrading will fix this problem.
if Splunk internal logs are stopped means that there was a connection problem between Forwarder and Indexers.
No, the internal logs are being indexed just fine. Only the Security event log which is stopped being indexed from time-to-time. That means the connection between forwarders and indexers is fine, right? Seems like I need to check the forwarders' configurations once more.
only a very stupid test: check if you have events with date 9th of January 2017, that have wrong date (1st of september 2017), maybe it's a timestamp recognition error.