Solved: Re: Splunk Universal Forwarder missing events

oscarminassian · ‎12-04-2017

Hi all,

Have you ever seen a UF missing events? I’ve observed some of our UF’s missing ~8 seconds of events and then picking up halfway through the event they reach. The gaps are creating some muddy data and it doesn’t seem to be limited to one server, I’ve got a list of 100 or so across all of our environments and corresponding Splunk clusters.

Here's a 3 line example of what Splunk is seeing in the source(/app/search/show_source?blah). I've been able to manually confirm that there is a gap and plenty of logs between.

2017-12-03 22:25:37 GET /Something/Something/1 from=2017-12-02&to=2017-12-04 80 - 0.0.0.0 HTTP/1.1 - - Some.url.was.here.com.au 200 0 0 00000 000 00 - HasedKeyWasHere ServiceName -
0.0.0.0 HTTP/1.1 - - ome.url.was.here.com.au 200 0 0 000 000 0 - HasedKeyWasHere ServiceName -
202017-12-03 22:25:45 GET /Something/Something/1 from=2017-12-02&to=2017-12-04 80 - 0.0.0.0 HTTP/1.1 - - Some.url.was.here.com.au 200 0 0 00000 000 00 - HasedKeyWasHere ServiceName -

I've tried this with and without line breaking logic to see if it would make any difference in the props.conf with no success. Which is not entirely surprising in hindsight.

It should be worth mentioning that these are all IIS logs being forwarded to a 6 peer node cluster with no heavy forwarders inbetween.

oscarminassian · ‎02-01-2018

Hey Splunkers,

As @ebaileytu has suggested and also with help of David at Splunk support, it was found to be the Universal Forwarder version. We're in the process of rolling the 7.0.1 version out to PROD. This was limited to our Windows environment and the problem has completely disappeared in our DEV/SIT and UAT environments since the upgrade!

Much winning!

View solution in original post

oscarminassian · ‎02-01-2018

Hey Splunkers,

As @ebaileytu has suggested and also with help of David at Splunk support, it was found to be the Universal Forwarder version. We're in the process of rolling the 7.0.1 version out to PROD. This was limited to our Windows environment and the problem has completely disappeared in our DEV/SIT and UAT environments since the upgrade!

Much winning!

meenuvn · ‎09-25-2018

@oscarminassian Did 7.0.1 UF upgrade help with the missing events issue in your case. My org is using 6.5.2 and we started realizing the same issue. Would be helpful if you could confirm that the issue is resolved with 7.0.1

ebaileytu · ‎02-01-2018

How is it going? Any luck? We have confirmed with Splunk support certain version of the 6.5 and 6.6 UFs have issues with the tailing processor and will drop/miss events. We were able to upgrade all UFs to 6.6.3 to get past it.

woodcock · ‎12-05-2017

Are you sure that the events are missing? What I have seen happen many times is that the events are there, just split in the wrong place (mid-event) such that only 1 half of the event meets the TIME_PREFIX and TIME_FORMAT settings so the other half gets a different timestamp and is no longer right next to his halfsie so it looks missing. The problem is usually buffering or chunking in the process that is writing the logfile and the only 2 solutions are to index the file after it rotates (after the writer is done writing to it) or to extend the amount of time that Splunk will wait for a write session to pause before assuming it is done by increasing the TIME_BEFORE_CLOSE setting in inputs.conf:
https://docs.splunk.com/Documentation/Splunk/latest/Admin/Inputsconf

time_before_close = <integer>
* Modification time delta required before the file monitor can close a file on
  EOF.
* Tells the system not to close files that have been updated in past <integer>
  seconds.
* Defaults to 3.

oscarminassian · ‎12-05-2017

Hi Woodcock,

100% sure events are missing in Splunk from multiple servers. I'm able to access the servers and verify a with the raw files by searching for the missing event in Splunk. I've observed a gap of 30 seconds with missing events in our SIT environment, about 400 events before it picks back up again like nothing happened. Not sure if TIME_BEFORE_CLOSE fits into this, and indexing the whole file after a day, or even after 15mins is not an option in Production. Too much monitoring and alerting.

oscarminassian · ‎12-05-2017

I bow to your supreme knowledge, Woodcock. I found some of the events that had been moved 11 hours into the next day! I've attempted to push the TIME_BEFORE_CLOSE out to 10 seconds. Let's see what happens overnight. 🙂

gjanders · ‎12-04-2017

Since the IIS sourcetype has indexed fields, if the incoming data doesn't match the sourcetype the data will fail to parse and will be lost.

I would test using another sourcetype that does not have indexed fields temporarily to see if the issue goes away...although only missing some events is strange, is it possible that the log format is not 100% consistent?

-
Alerts for Splunk Admins, Version Control for Splunk, Decrypt2 VersionControl For SplunkCloud

oscarminassian · ‎12-05-2017

This was one of my first thoughts, we had a puppet change a few months ago that removed the cookie from the IIS logs. Oh boy, the data didn't like that. It went away after the log file rotated. I've been able to verify that it's not the case and the logging is 100% uniform across our IIS fleet.

gjanders · ‎12-05-2017

Thanks, either the splunkd log file of the indexer or the forwarder might drop some hints...

-
Alerts for Splunk Admins, Version Control for Splunk, Decrypt2 VersionControl For SplunkCloud

oscarminassian · ‎12-05-2017

Thanks for the insights, I did some back searching in our S3 archive. Looks like we've had this issue for a long long time, it's just never been reported.

harsmarvania57 · ‎12-04-2017

Hi @oscarminassian,

I suspect parsing issue in your case, have you tried to search All Time for those missing event with timestamp in your query? What is sourcetype are you using for IIS logs?

oscarminassian · ‎12-04-2017

@harsmarvania57, sure did and no luck! 😞

Yeah, these are all IIS sourcetype. I'm using the following search to separate the bad from the good and getting lots of results.

index=web sc_status!=0
| regex sc_status!= ^\d{3}$
| regex sc_status!= ^\d{4}$
| regex _raw!=^\d{4}-\d{2}-\d{2}
| stats count by sc_status host

Also worth mentioning that we're on 6.6.1

harsmarvania57 · ‎12-04-2017

Any parsing errors in splunkd.log on Indexers ? And I assumed that you searched for index=web "2017-12-03" for All Time and you didn't get any events which ingested in wrong date, am I right?

oscarminassian · ‎12-05-2017

No Parsing errors that I can find. Initially I was unable to find any events that had come in on the wrong date time, but I found some! It was hard to track down and I pretty much came across it by accident.

Ratan · ‎03-22-2022

Dear Oscarminassian,

I am facing the same issue of missing lines.

Did you find any solution for the missing events issue? If yes, Could you please share the same here.

Thanks

Ratan

Why is Splunk universal forwarder missing events?

universal forwarder

Introducing the Splunk Community Dashboard Challenge!

Built-in Service Level Objectives Management to Bridge the Gap Between Service & ...

Get Your Exclusive Splunk Certified Cybersecurity Defense Engineer Certification at ...