Hello,
I have created server down and up alerts separately which triggers when the server is down on the basis of percentile80>5 and up when the percentile80<5.
But I want to create one combine alert which should trigger all the time when the server is down and I just want only one up alert (Recovery alert) once the server is up again, means it should not trigger multiple alerts for up until it again down.
Any way to get this done ?
Below is the query :
Time Range is last 15 minutes and Cron job is */2 * * * * (every 2 minutes)
index=xyz sourcetype=xyz host=*
| eval RespTime=time_taken/1000
| eval RespTime = round(RespTime,2)
| bucket _time span=2m
| stats avg(RespTime) as Average perc80(RespTime) as "Percentile_80" by _time
| eval Server_Status=if(Percentile_80>=5, "Server Down", "Server UP")
So above alert should trigger when the Server is down and it should trigger every 2 minutes until is up. And then alert should trigger only once when the server is Up again and it should not trigger every 2 minutes until the server is down again.
One possible solution would be to use a lookup (status_lookup) to keep track of the last known state. This solution adds a host field so it can work for more than one host.
Step 1:
Create a KVStore (or file based) lookup with the fields "host", and "current_status" (Note: the solution below will also add an alert message field, but that 's more of a side effect.)
Step 2:
Add the "host" group by clause, and lookup commands to your SPL:
index=xyz sourcetype=xyz host=*
| eval RespTime=time_taken/1000
| eval RespTime = round(RespTime,2)
| bucket _time span=2m
| stats avg(RespTime) as Average perc80(RespTime) as "Percentile_80" by _time host
| eval Current_Server_Status=if(Percentile_80>=5, "Server Down", "Server Up")
| lookup status_lookup host
| eval alert=case(Current_Server_Status="Server Down",$host$+" is down",
(Current_Server_Status="Server Up" AND Server_Status="Server Down"),$host$+" is back up")
| rename Current_Server_Status AS Server_Status
| table host Server_Status alert
| outputlookup status_lookup
You'll end up with a serach that outputs something like this (and updates the lookup for the next alert run):
+---------------+--------------+------+
| Server_Status | alert | host |
+---------------+--------------+------+
| Server Down | a is down | a |
| Server Up | b is back up | b |
| Server Up | | c |
| Server Down | d is down | d |
+---------------+--------------+------+
Note that host c has no alert message because it went from "up" to "up" with the sample data I used.