Good morning,
I have a Loss of Feeds alert that looks every 15 minutes over the last 24 hours to check when various hosts were last seen. I also use a lookup table with the fifteen hosts since some hosts are lower volume than others. So the lookup table defines a max delay for the host and the alert notifies me if any host has exceeded the max delay.
What I really want is to better define the max delay. Is there a way I can look over a 24 hour period and see the maximum or average gap between events for a host? Once I determine that I can fine tune my may delay times in the lookup table and hopefully stop the constant alerts.
Thanks for any help.
Like this:
index=* earliest=-24h latest=now | delta _time AS pause | stats avg(pause) max(pause) BY index sourcetype host
This is awesome. Would you recommend joining this as a subsearch to the OP's 15-minute search so that the alert is only triggered when the 15-minute result satisfies some condition when compared to the 24h search?
No, I would have 2 searches: 1 that continuously updates the max
and avg
values and then your existing one. Have the first one dump values using |outputlookup ...
.
Like this:
index=_audit earliest=-15m latest=now
| delta _time AS pause
| eval pause=-pause
| stats avg(pause) as avg_15m max(pause) as max_15m BY host
| join host
[ search
index=_audit earliest=-24h latest=now
| delta _time AS pause | eval pause=-pause
| stats avg(pause) as avg_24h max(pause) as max_24h BY host]
| where avg_15m > avg_24h
So I took your query and I want to make sure I'm understanding the results properly. For the column marked max_15m, does that mean over the course of 24 hours there were "X" times it hadn't talked to Splunk for more than 15 mins? If so, I have a host in my results that shows "4" for max_15m column. That means there were four times for that host where it went more than 15 minutes without talking to Splunk?
If so, I could increase the interval to 30m and see if that drops to zero. I really want to get as few alerts as possible. So if it's "normal" for a host to go four times a day for more than 15 minutes without talking to Splunk, I set the max delay time in my alert to something like 20, 30, 40, etc. mins based on the tweaking I do with your query. Am I understanding that right?
In my query, the max_15m means the largest number of seconds in the last 15 minutes where there were no events for that host. Max_24h is the longest pause (in seconds) in the last 24 hours.
It occurs to me that there will often be values higher than the average in the last 24 hours, and never be anything higher than the max, so maybe a percentile value would be good?
If you want to count the number of times where the pause was greater than the 95th percentile in the last 24 hours, you could do something like this:
index=_audit earliest=-15m latest=now
| delta _time AS pause
| eval pause=-pause
| join host
[ search
index=_audit earliest=-24h latest=now
| delta _time AS pause
| eval pause=-pause
| stats avg(pause) as avg_24h max(pause) as max_24h perc95(pause) as ptile_24h BY host]
| eval countme=if(pause>ptile_24h, 1, 0)
| stats sum(countme) by host
I'm using the _audit index because it will run anywhere, but maybe you can generalize this.