Hi all,
We have realised recently that one of our application logs is missing a large number of events. This was evidenced when we were looking for a few exceptions, flagged WARNING level. Between a given two minute period, we were expecting ~50 events, but there were none in Splunk.
Details:
Logs in /var/log/apps/appname/appname-application.log
.log is always the file being written to
Log rolls go to .log.1, then .log.2 -- for a total of 100 log files
This specific application has very spammy logs -- roughly 11 or 12 GB a day
Filters are in place at with an indexer level app -- most INFO level messages, apart from some specifics, are filtered out
WARNING level messages are purposefully not filtered away and should be present
Cannot see any queues being blocked in the metrics.log
The sourcetype for this monitor stanza is sending roughly 250kbps worth of data -- includes multiple applications on server
Events are 45 minutes delayed before they appear in Splunk -- currently it's 5pm, but the most recent event I have is from ten past 4
Data is being throttled -- we are looking at increasing this, but we are going to test it on a non-production environment first
Does anyone have any ideas as to what could be causing events to go missing?
Thanks and regards,
Alex
So, it turns out that ignoreOlderThan works differently than I previously thought. As a lot of the files hadn't been written to for upwards of a month, Splunk had stopped monitoring the files, so, even though they had been modified recently, Splunk was not picking them up.
In other words, avoid ignoreOlderThan like the plague.
So, it turns out that ignoreOlderThan works differently than I previously thought. As a lot of the files hadn't been written to for upwards of a month, Splunk had stopped monitoring the files, so, even though they had been modified recently, Splunk was not picking them up.
In other words, avoid ignoreOlderThan like the plague.
As @harsmarvania57 mentioned, problems with timestamping
are always to be investigated as a source for these kinds of things. Run this search for "All time" (yes, you MUST run it for "All time") and see if you can find your missing events/sources somewhere that you do not expect them. A "good" lagSecs
value is in the range 100..1000 and anything <0 is a big problem:
... | eval lagSecs=(_indextime - _time) | stats avg(lagSecs) AS lagSecs BY index,sourcetype,source,host
Can you please check your log timestamp, is splunk is recognising it perfectly? Recently we had same issue , logs contains date format dd/mm/yyyy, but splunk recognising as mm/dd/yyyy, so logs going to previous month or more than that. So try something like this pickup a unique keyword from log files which is not showing in splunk and search query for "All Time". If you will find your logs then there is a timestamp recognition problem and to rectify the issue, you need to use "TIME_FORMAT" in props.conf
Double-check the filter on your indexer. If your license limits allow, consider briefly disabling the filter to see if missing events appear.