Solved: Why did all of my servers stop sending logs? Confi...

vincenteous · ‎08-28-2017

Hello Guys,

I have a bit of a curious case and it is really bugging our production environment. I have deployed around 12 Windows UF to monitor Security event logs within AD servers which are located behind a firewall. The version of the UFs is 5.0.2 currently and I have set the input and output configurations using a deployment server.

From the first deployment, I could see all 12 servers are sending the logs just fine. After several hours, the number of servers dropped to 7. The drop sequence continue until no server is sending logs at all.

I tried to use just a single server as a test project and I found that the server is only sending logs for about 3 - 4 hours max before stopped sending completely. No errors or warnings found within splunkd.log of the forwarder and my indexer. The splunkd.log's entries were only "Connected to ...." and "... phone home ....". I also did not see any blocking event from metrics.log

My configurations are like this:
inputs.conf

[WinEventLog://Security]
disabled = 0
index = app_ad
sourcetype = tseladscrt
start_from = oldest
current_only = 0
_TCP_ROUTING = loadheavyfwd

outputs.conf

[tcpout:loadheavyfwd]
compressed = true
server = <indexerip>:9997
sslCertPath = D:\Program Files\SplunkUniversalForwarder\etc\auth\cert.pem
sslPassword = xxxxxxxxxxxxx
sslRootCAPath = D:\Program Files\SplunkUniversalForwarder\etc\auth\CoreCA.pem
sslVerifyServerCert = true

Where should I start to troubleshoot?

Thank you.

traxxasbreaker · ‎08-31-2017

When you installed the forwarders, were the boxes checked to start the Windows event log collection, or were those inputs defined for the first time from the deployment server? The known issues for version 5.0.2 mentions a problem in which a restart while an event log is being read via a [monitor://] stanza in the inputs.conf file can cause Splunk to abandon reading the file further... If the collection was in place before it was pushed from the deployment server, the deployment server could have triggered a restart that caused it to hit that issue.

The same document also mentions in a different section that in that version the Universal forwarder can sometimes stop forwarding Windows security and application event logs when anti-virus is running on the forwarder but it doesn't give many more details. If you open a support case there's a good chance they'd be able to tell pretty easily if you're hitting one of those and what the workaround is.

View solution in original post

gcusello · ‎09-06-2017

HI
at first a question: why did you use _TCP_ROUTING = loadheavyfwd ? it's manadatory only in selective forwarding and in your outputs.conf there isn't it.
Then update forwarders because 5.x version will be out od date soon.
After verify if Splunk internal logs continously arrive or not (index=_internal).
Bye.
Giuseppe

vincenteous · ‎09-06-2017

Hi,

I didn't include all of the indexers in this sample of outputs.conf. In real, I have 4 indexers and all of my forwarders are pointing to all 4 indexers.

About the splunk internal logs, I didn't see any error from splunkd.log and the metrics.log also showed more than 0.00 Kbps for raw events. But, the data sometimes stop and after a few hours it's normal again for several hours.

Of course I'm planning to upgrade, but the client needs the justification that upgrading will fix this problem.

Thanks

gcusello · ‎09-06-2017

if Splunk internal logs are stopped means that there was a connection problem between Forwarder and Indexers.
Bye.
Giuseppe

vincenteous · ‎09-07-2017

No, the internal logs are being indexed just fine. Only the Security event log which is stopped being indexed from time-to-time. That means the connection between forwarders and indexers is fine, right? Seems like I need to check the forwarders' configurations once more.

gcusello · ‎09-07-2017

only a very stupid test: check if you have events with date 9th of January 2017, that have wrong date (1st of september 2017), maybe it's a timestamp recognition error.
Bye.
Giuseppe

traxxasbreaker · ‎08-31-2017

When you installed the forwarders, were the boxes checked to start the Windows event log collection, or were those inputs defined for the first time from the deployment server? The known issues for version 5.0.2 mentions a problem in which a restart while an event log is being read via a [monitor://] stanza in the inputs.conf file can cause Splunk to abandon reading the file further... If the collection was in place before it was pushed from the deployment server, the deployment server could have triggered a restart that caused it to hit that issue.

The same document also mentions in a different section that in that version the Universal forwarder can sometimes stop forwarding Windows security and application event logs when anti-virus is running on the forwarder but it doesn't give many more details. If you open a support case there's a good chance they'd be able to tell pretty easily if you're hitting one of those and what the workaround is.

vincenteous · ‎09-05-2017

Hello Traxx,

I apologize for the late response. Thank you for the explanation. Would you link me to the document mentioned in your answer? Also, will upgrading the forwarder to version >= 6.3 solve the problem? I am planning to upgrade my Splunk environment and I need a justification for this one.

Thanks.

Why did all of my servers stop sending logs? Configuration issue?

Shape the Future of Splunk: Join the Product Research Lab!

Auto-Injector for Everything Else: Making OpenTelemetry Truly Universal

[Puzzles] Solve, Learn, Repeat: Character substitutions with Regular Expressions

Are you a member of the Splunk Community?

Why did all of my servers stop sending logs? Configuration issue?

Shape the Future of Splunk: Join the Product Research Lab!

Auto-Injector for Everything Else: Making OpenTelemetry Truly Universal

[Puzzles] Solve, Learn, Repeat: Character substitutions with Regular Expressions