Alerting

False alert - delay in log writing?

thezen
Explorer

We are getting a random false alert from a Splunk (6.5.2) search that's looking for if a certain string is not found in a log file within the last 15m.

When we did an investigation and tried to search, the string was there for the alert period, so it shouldn't have triggered any alert.

We couldn't find any relevant error in the splunkd log on the forwarder, but I did notice the two consecutive entries on the metrics.log:

1/25/19 4:55:01.800 PM 01-25-2019 16:55:01.800 +1100 INFO  Metrics - group=per_source_thruput, series="/XXX/systemerr.log", kbps=10.196221, eps=0.193555, kb=316.072266, ev=6, avg_age=1389.166667, max_age=1667 
1/25/19 4:22:59.801 PM 01-25-2019 16:22:59.801 +1100 INFO  Metrics - group=per_source_thruput, series="/XXX/systemerr.log", kbps=6.268667, eps=0.161285, kb=194.334961, ev=5, avg_age=211.600000, max_age=265

We got the false alert around 4:54, so if I understand correctly by looking at the time gap and the "avg_age" value, it might be possible that the alert was triggered because the data was only being read after 4:55; there was no update (new lines) on the file from 4:22 until 4:55.

So the question is, is my understanding correct? Is the problem caused by a delay in writing the data in the source log file, or is it because of a processing delay in Splunk itself?

Appreciate any advise,

Tags (1)
0 Karma

dkeck
Influencer

HI

what could help you to understand delayed indexing, you could have a look at the delta between time and index time.

| eval delta=_indextime-_time

If the value for delta is high you might experience delay of log ingestion.

0 Karma

thezen
Explorer

Thanks for that, I'm sure that will be useful but in my case I have suspect the _time value itself may not be useful as it's parsed from the logfile content, not the actual time when the logfile is being read by the forwarder. So even though the delta value shows delay about 15 minutes on the oldest set of record, it might be more because the log line itself was not available in the target log until 15m after the actual occurence.

I guess I should rephrase the question, based on the metrics.log output what does the time gap between the two entries, as shown on my original post, mean? In normal situation, there's rarely any gap between the lines for more than a minute. I can only guess it's because the new lines were being added after 4:55 in this case, but not entirely sure.

0 Karma
Get Updates on the Splunk Community!

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...