Hi,
I need to monitor "host failure events" per hour over last 24 hours for a group of 50 hosts. When the total reaches a threshold like 10 fails, an alert email needs to be sent. This count and total needs to occur each hour.
What I want to do is schedule a report to count the fails by each host per hour, save the count, and then add the next hourly count to the previous count. When any host reaches 10 fails within the 24 hour window, the triggered action needs to send an email.
At midnight, I would like to reset the count.
Any advice appreciated.
Thank you
index=foo "failed"
| stats min(_time) as _time count by host
| eval _time=strftime(_time,"%F %H%M")
| outputlookup append=t Failed_Count
It's better to add _time
and use outputlookup
with append=true
.
For alerting:
| inputlookup Failed_Count
| where strptime(_time, "%F %H%M") > relative_time(now(),"-1d")
| stats sum(count) as total by host
| where total > 10
If event count > 0, fire alert.
https://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Outputlookup
index=foo "failed"
| stats min(_time) as _time count by host
| eval _time=strftime(_time,"%F %H%M")
| outputlookup append=t Failed_Count
It's better to add _time
and use outputlookup
with append=true
.
For alerting:
| inputlookup Failed_Count
| where strptime(_time, "%F %H%M") > relative_time(now(),"-1d")
| stats sum(count) as total by host
| where total > 10
If event count > 0, fire alert.
https://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Outputlookup
WOW that is awesome!!!
I was going round and round not quite getting it... but that is exactly what I was trying to do...
although the system admin said that my default query would work as well - running every hour and sending an alert when results are greater than > 1
index=<foo> earliest=-24h@h latest=@h "<some bad failure msg>" |bin _time span=1h |stats count by host _time |eventstats sum(count) as totalCount by host | where totalCount > 10
one followup question, if I keep your outputlook method running, how do I purge the old data after a day or so, because the file might grow to a huge size and cause issues (I am thinking...)
Thank you very much !!!
https://splunkbase.splunk.com/app/1724/
or delete by script
or make another query to check and delete extra rows.
thank you! please convert previous to an answer and I will accept
I was thinking about using a summary index with a 24 hour look back each hour, but someone mentioned using an output lookup instead...
So trying the outputlookup method...
I created a lookup called "Failed_Count" with a file.csv that contains 2 fields host,count.
I can run a query like this>>
index=foo "failed" |stats count by host | outputlookup Failed_Count
and it updates, but I have no luck adding the previous hour count to the total...
Any ideas?
...|table fields from inputlookup add results from current search to table then | outputlookup...
I am guessing