Splunk Enterprise

what besides a delay in indexing can result in finding events late?

MonkeyK
Builder

Lately my teammates have been searching for email logs and not finding them until several hours or even a day later.  

I checked for latency issues by comparing log record times to _indextime but never found latency more than 20 minutes 

 

index=proofpoint source=proofpoint_message_log 
| eval indextime=strftime(_indextime,"%Y-%m-%d %H:%M:%S") 
| eval delay=_indextime-_time | timechart span=1m max(delay)

 

 

so ingestion seems to have happened within 20 minutes of all records.  But my teammates cannot find some of the logs until hours later.

 

Has anyone here seen this sort of thing?  In the past I've only seen this as the result of delayed ingestion.  What else could cause it?

We did recently move to SmartStore.  Is it possible that this is adding additional delay in Splunk determining which buckets to search?

 

Labels (2)
0 Karma

MonkeyK
Builder

Thanks, I tried looking for min and max delays and found no negative numbers.

 

and then I had a recent example.  

Expected two events on 11/10 early afternoon (12:30 and 14:25).  Late afternoon I searched and I only found one of them.  Considering possible time zone issues, I tried searching several hours into the future (until midnight, so 7.5 hours past when I expected the second event): still only found one. 

After I failed to find the second event, I created an alert for the missing event. The alert never triggered.

I ran another search 11/11 early afternoon and still only found the one event

I ran another search on 11/13 and I found the missing event.  It had a indextime of 11/10, within 25 minutes of when the email system said the record was created

0 Karma

Richfez
SplunkTrust
SplunkTrust

My first thought was their timestamps are being handled incorrectly, or they're wrong in the data.  This is sometimes because they're just plain wrong (Wouldn't be the first time I've seen them wrong *in the events*, but usually either that you are using the wrong field for _time, or because of timezones are off or not being recognized.

Take a concrete example -  An event comes in now, but the timestamp says now+1 hour.

_indextime is set to 1605131264  and _time is set to 1605131264 + 3600 = 1605134864.

For most searches, this won't show up until an hour from now.

And at that time, the indexing "lag" will be 1605131264 - 1605134864, which is -3600, ...

So your search to find max(_indextime) will not probably ever point that one out.  0 is larger than -3600.

There are two ways to find these:

You could just do an abs() on the eval part, like this, to remove that negative sign.

...
| eval indextime=strftime(_indextime,"%Y-%m-%d %H:%M:%S") 
| eval delay=abs(_indextime-_time) | timechart span=1m max(delay)

But it might be useful to see if this is the case directly, in which case you can leave off the abs() and instead just timechart both the max and the min of the delay like in the below:

...
| eval indextime=strftime(_indextime,"%Y-%m-%d %H:%M:%S") 
| eval delay=_indextime-_time | timechart span=1m max(delay) min(delay)

Then you can see both sides of that chart and see if this is indeed what's happening.

If this is what it is, then resolving it nearly always involves revisiting your timestamps and when you find what may be smoking guns in there, it might be best to create a new question to handle that topic.

Happy Splunking!

-Rich

 

0 Karma
Take the 2021 Splunk Career Survey

Help us learn about how Splunk has
impacted your career by taking the 2021 Splunk Career Survey.

Earn $50 in Amazon cash!