Lately my teammates have been searching for email logs and not finding them until several hours or even a day later.
I checked for latency issues by comparing log record times to _indextime but never found latency more than 20 minutes
index=proofpoint source=proofpoint_message_log | eval indextime=strftime(_indextime,"%Y-%m-%d %H:%M:%S") | eval delay=_indextime-_time | timechart span=1m max(delay)
so ingestion seems to have happened within 20 minutes of all records. But my teammates cannot find some of the logs until hours later.
Has anyone here seen this sort of thing? In the past I've only seen this as the result of delayed ingestion. What else could cause it?
We did recently move to SmartStore. Is it possible that this is adding additional delay in Splunk determining which buckets to search?
Ever get a solution to this? I've had a report from a user that their searches for a specific event didn't show up for a couple hours, though because it was already there by the time I came online for the day I could only investigate after the fact without the ability to reproduce the problem. comparing _time to _ingesttime I get diffs of ~23s and 70s respectively and can find no reason why the results didn't come up for this user. I even checked his search history and syntax and confirmed he didn't search the wrong timeframe or use bad syntax. The searches (at least some of them) *should* have returned the logs he was looking for.
Thanks, I tried looking for min and max delays and found no negative numbers.
and then I had a recent example.
Expected two events on 11/10 early afternoon (12:30 and 14:25). Late afternoon I searched and I only found one of them. Considering possible time zone issues, I tried searching several hours into the future (until midnight, so 7.5 hours past when I expected the second event): still only found one.
After I failed to find the second event, I created an alert for the missing event. The alert never triggered.
I ran another search 11/11 early afternoon and still only found the one event
I ran another search on 11/13 and I found the missing event. It had a indextime of 11/10, within 25 minutes of when the email system said the record was created
My first thought was their timestamps are being handled incorrectly, or they're wrong in the data. This is sometimes because they're just plain wrong (Wouldn't be the first time I've seen them wrong *in the events*, but usually either that you are using the wrong field for _time, or because of timezones are off or not being recognized.
Take a concrete example - An event comes in now, but the timestamp says now+1 hour.
_indextime is set to 1605131264 and _time is set to 1605131264 + 3600 = 1605134864.
For most searches, this won't show up until an hour from now.
And at that time, the indexing "lag" will be 1605131264 - 1605134864, which is -3600, ...
So your search to find max(_indextime) will not probably ever point that one out. 0 is larger than -3600.
There are two ways to find these:
You could just do an abs() on the eval part, like this, to remove that negative sign.
... | eval indextime=strftime(_indextime,"%Y-%m-%d %H:%M:%S") | eval delay=abs(_indextime-_time) | timechart span=1m max(delay)
But it might be useful to see if this is the case directly, in which case you can leave off the abs() and instead just timechart both the max and the min of the delay like in the below:
... | eval indextime=strftime(_indextime,"%Y-%m-%d %H:%M:%S") | eval delay=_indextime-_time | timechart span=1m max(delay) min(delay)
Then you can see both sides of that chart and see if this is indeed what's happening.
If this is what it is, then resolving it nearly always involves revisiting your timestamps and when you find what may be smoking guns in there, it might be best to create a new question to handle that topic.