Alerting

How To Determine The Average Span Between Loss of Feeds For a Host

SplunkLunk
Path Finder

Good morning,

I have a Loss of Feeds alert that looks every 15 minutes over the last 24 hours to check when various hosts were last seen. I also use a lookup table with the fifteen hosts since some hosts are lower volume than others. So the lookup table defines a max delay for the host and the alert notifies me if any host has exceeded the max delay.

What I really want is to better define the max delay. Is there a way I can look over a 24 hour period and see the maximum or average gap between events for a host? Once I determine that I can fine tune my may delay times in the lookup table and hopefully stop the constant alerts.

Thanks for any help.

Tags (1)
0 Karma

woodcock
Esteemed Legend

Like this:

index=* earliest=-24h latest=now | delta _time AS pause | stats avg(pause) max(pause) BY index sourcetype host

grittonc
Contributor

This is awesome. Would you recommend joining this as a subsearch to the OP's 15-minute search so that the alert is only triggered when the 15-minute result satisfies some condition when compared to the 24h search?

0 Karma

woodcock
Esteemed Legend

No, I would have 2 searches: 1 that continuously updates the max and avg values and then your existing one. Have the first one dump values using |outputlookup ....

0 Karma

grittonc
Contributor

Like this:
index=_audit earliest=-15m latest=now
| delta _time AS pause
| eval pause=-pause
| stats avg(pause) as avg_15m max(pause) as max_15m BY host
| join host
[ search
index=_audit earliest=-24h latest=now
| delta _time AS pause | eval pause=-pause
| stats avg(pause) as avg_24h max(pause) as max_24h BY host]
| where avg_15m > avg_24h

0 Karma

SplunkLunk
Path Finder

So I took your query and I want to make sure I'm understanding the results properly. For the column marked max_15m, does that mean over the course of 24 hours there were "X" times it hadn't talked to Splunk for more than 15 mins? If so, I have a host in my results that shows "4" for max_15m column. That means there were four times for that host where it went more than 15 minutes without talking to Splunk?

If so, I could increase the interval to 30m and see if that drops to zero. I really want to get as few alerts as possible. So if it's "normal" for a host to go four times a day for more than 15 minutes without talking to Splunk, I set the max delay time in my alert to something like 20, 30, 40, etc. mins based on the tweaking I do with your query. Am I understanding that right?

0 Karma

grittonc
Contributor

In my query, the max_15m means the largest number of seconds in the last 15 minutes where there were no events for that host. Max_24h is the longest pause (in seconds) in the last 24 hours.

It occurs to me that there will often be values higher than the average in the last 24 hours, and never be anything higher than the max, so maybe a percentile value would be good?

If you want to count the number of times where the pause was greater than the 95th percentile in the last 24 hours, you could do something like this:

index=_audit earliest=-15m latest=now 
| delta _time AS pause 
| eval pause=-pause 
| join host 
    [ search
        index=_audit earliest=-24h latest=now 
    | delta _time AS pause 
    | eval pause=-pause 
    | stats avg(pause) as avg_24h max(pause) as max_24h perc95(pause) as ptile_24h BY host] 
| eval countme=if(pause>ptile_24h, 1, 0)
| stats sum(countme) by host

I'm using the _audit index because it will run anywhere, but maybe you can generalize this.

0 Karma
Get Updates on the Splunk Community!

Extending Observability Content to Splunk Cloud

Watch Now!   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to leverage ...

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!

What if there was a way you could keep all the metrics data you need while saving on storage costs?This is now ...

New in Observability Cloud - Explicit Bucket Histograms

Splunk introduces native support for histograms as a metric data type within Observability Cloud with Explicit ...