Getting Data In

Why is Splunk universal forwarder missing events?

oscarminassian
Path Finder

Hi all,

Have you ever seen a UF missing events? I’ve observed some of our UF’s missing ~8 seconds of events and then picking up halfway through the event they reach. The gaps are creating some muddy data and it doesn’t seem to be limited to one server, I’ve got a list of 100 or so across all of our environments and corresponding Splunk clusters.

Here's a 3 line example of what Splunk is seeing in the source(/app/search/show_source?blah). I've been able to manually confirm that there is a gap and plenty of logs between.

2017-12-03 22:25:37 GET /Something/Something/1 from=2017-12-02&to=2017-12-04 80 - 0.0.0.0 HTTP/1.1 - - Some.url.was.here.com.au 200 0 0 00000 000 00 - HasedKeyWasHere ServiceName -
0.0.0.0 HTTP/1.1 - - ome.url.was.here.com.au 200 0 0 000 000 0 - HasedKeyWasHere ServiceName -
202017-12-03 22:25:45 GET /Something/Something/1 from=2017-12-02&to=2017-12-04 80 - 0.0.0.0 HTTP/1.1 - - Some.url.was.here.com.au 200 0 0 00000 000 00 - HasedKeyWasHere ServiceName -

I've tried this with and without line breaking logic to see if it would make any difference in the props.conf with no success. Which is not entirely surprising in hindsight.

It should be worth mentioning that these are all IIS logs being forwarded to a 6 peer node cluster with no heavy forwarders inbetween.

Labels (1)
0 Karma
1 Solution

oscarminassian
Path Finder

Hey Splunkers,

As @ebaileytu has suggested and also with help of David at Splunk support, it was found to be the Universal Forwarder version. We're in the process of rolling the 7.0.1 version out to PROD. This was limited to our Windows environment and the problem has completely disappeared in our DEV/SIT and UAT environments since the upgrade!

Much winning!

View solution in original post

0 Karma

oscarminassian
Path Finder

Hey Splunkers,

As @ebaileytu has suggested and also with help of David at Splunk support, it was found to be the Universal Forwarder version. We're in the process of rolling the 7.0.1 version out to PROD. This was limited to our Windows environment and the problem has completely disappeared in our DEV/SIT and UAT environments since the upgrade!

Much winning!

0 Karma

meenuvn
Explorer

@oscarminassian Did 7.0.1 UF upgrade help with the missing events issue in your case. My org is using 6.5.2 and we started realizing the same issue. Would be helpful if you could confirm that the issue is resolved with 7.0.1

0 Karma

ebaileytu
Communicator

How is it going? Any luck? We have confirmed with Splunk support certain version of the 6.5 and 6.6 UFs have issues with the tailing processor and will drop/miss events. We were able to upgrade all UFs to 6.6.3 to get past it.

0 Karma

woodcock
Esteemed Legend

Are you sure that the events are missing? What I have seen happen many times is that the events are there, just split in the wrong place (mid-event) such that only 1 half of the event meets the TIME_PREFIX and TIME_FORMAT settings so the other half gets a different timestamp and is no longer right next to his halfsie so it looks missing. The problem is usually buffering or chunking in the process that is writing the logfile and the only 2 solutions are to index the file after it rotates (after the writer is done writing to it) or to extend the amount of time that Splunk will wait for a write session to pause before assuming it is done by increasing the TIME_BEFORE_CLOSE setting in inputs.conf:
https://docs.splunk.com/Documentation/Splunk/latest/Admin/Inputsconf

time_before_close = <integer>
* Modification time delta required before the file monitor can close a file on
  EOF.
* Tells the system not to close files that have been updated in past <integer>
  seconds.
* Defaults to 3.
0 Karma

oscarminassian
Path Finder

Hi Woodcock,

100% sure events are missing in Splunk from multiple servers. I'm able to access the servers and verify a with the raw files by searching for the missing event in Splunk. I've observed a gap of 30 seconds with missing events in our SIT environment, about 400 events before it picks back up again like nothing happened. Not sure if TIME_BEFORE_CLOSE fits into this, and indexing the whole file after a day, or even after 15mins is not an option in Production. Too much monitoring and alerting.

0 Karma

oscarminassian
Path Finder

I bow to your supreme knowledge, Woodcock. I found some of the events that had been moved 11 hours into the next day! I've attempted to push the TIME_BEFORE_CLOSE out to 10 seconds. Let's see what happens overnight. 🙂

gjanders
SplunkTrust
SplunkTrust

Since the IIS sourcetype has indexed fields, if the incoming data doesn't match the sourcetype the data will fail to parse and will be lost.

I would test using another sourcetype that does not have indexed fields temporarily to see if the issue goes away...although only missing some events is strange, is it possible that the log format is not 100% consistent?

0 Karma

oscarminassian
Path Finder

This was one of my first thoughts, we had a puppet change a few months ago that removed the cookie from the IIS logs. Oh boy, the data didn't like that. It went away after the log file rotated. I've been able to verify that it's not the case and the logging is 100% uniform across our IIS fleet.

0 Karma

gjanders
SplunkTrust
SplunkTrust

Thanks, either the splunkd log file of the indexer or the forwarder might drop some hints...

0 Karma

oscarminassian
Path Finder

Thanks for the insights, I did some back searching in our S3 archive. Looks like we've had this issue for a long long time, it's just never been reported.

0 Karma

harsmarvania57
SplunkTrust
SplunkTrust

Hi @oscarminassian,

I suspect parsing issue in your case, have you tried to search All Time for those missing event with timestamp in your query? What is sourcetype are you using for IIS logs?

oscarminassian
Path Finder

@harsmarvania57, sure did and no luck! 😞

Yeah, these are all IIS sourcetype. I'm using the following search to separate the bad from the good and getting lots of results.

index=web sc_status!=0
| regex sc_status!= ^\d{3}$
| regex sc_status!= ^\d{4}$
| regex _raw!=^\d{4}-\d{2}-\d{2}
| stats count by sc_status host

Also worth mentioning that we're on 6.6.1

0 Karma

harsmarvania57
SplunkTrust
SplunkTrust

Any parsing errors in splunkd.log on Indexers ? And I assumed that you searched for index=web "2017-12-03" for All Time and you didn't get any events which ingested in wrong date, am I right?

0 Karma

oscarminassian
Path Finder

No Parsing errors that I can find. Initially I was unable to find any events that had come in on the wrong date time, but I found some! It was hard to track down and I pretty much came across it by accident.

0 Karma

Ratan
Observer

Dear Oscarminassian,

I am facing the same issue of missing lines.

Did you find any solution for the missing events issue? If yes, Could you please share the same here.

 

Thanks

Ratan

0 Karma
Get Updates on the Splunk Community!

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...

Let’s Get You Certified – Vegas-Style at .conf24

Are you ready to level up your Splunk game? Then, let’s get you certified live at .conf24 – our annual user ...