Alerting

Server Down and Up Alert

Chirag812
Explorer

Hello,

I have created server down and up alerts separately which triggers when the server is down on the basis of percentile80>5 and up when the percentile80<5.

But I want to create one combine alert which should trigger all the time when the server is down and I just want only one up alert (Recovery alert) once the server is up again, means it should not trigger multiple alerts for up until it again down.

Any way to get this done ?

Below is the query :

Time Range is last 15 minutes and Cron job is */2 * * * * (every 2 minutes)

index=xyz sourcetype=xyz host=*
| eval RespTime=time_taken/1000
| eval RespTime = round(RespTime,2)
| bucket _time span=2m
| stats avg(RespTime) as Average perc80(RespTime) as "Percentile_80" by _time
| eval Server_Status=if(Percentile_80>=5, "Server Down", "Server UP")


So above alert should trigger when the Server is down and it should trigger every 2 minutes until is up. And then alert should trigger only once when the server is Up again and it should not trigger every 2 minutes until the server is down again.

Labels (1)
0 Karma

P_vandereerden
Splunk Employee
Splunk Employee

One possible solution would be to use a lookup (status_lookup) to keep track of the last known state.  This solution adds a host field so it can work for more than one host.

Step 1:
Create a KVStore (or file based) lookup with the fields "host", and "current_status" (Note: the solution below will also add an alert message field, but that 's more of a side effect.)

Step 2: 
Add the "host" group by clause, and lookup commands to your SPL:

index=xyz sourcetype=xyz host=*
| eval RespTime=time_taken/1000
| eval RespTime = round(RespTime,2)
| bucket _time span=2m
| stats avg(RespTime) as Average perc80(RespTime) as "Percentile_80" by _time host
| eval Current_Server_Status=if(Percentile_80>=5, "Server Down", "Server Up")  
| lookup status_lookup host
| eval alert=case(Current_Server_Status="Server Down",$host$+" is down",
                 (Current_Server_Status="Server Up" AND Server_Status="Server Down"),$host$+" is back up") 
| rename Current_Server_Status AS Server_Status 
| table host Server_Status alert 
| outputlookup status_lookup


You'll end up with a serach that outputs something like this (and updates the lookup for the next alert run):

+---------------+--------------+------+
| Server_Status	| alert	       | host |
+---------------+--------------+------+
| Server Down	| a is down    | a    |
| Server Up     | b is back up | b    |
| Server Up     |              | c    |
| Server Down   | d is down    | d    |
+---------------+--------------+------+

Note that host c has no alert message because it went from "up" to "up" with the sample data I used.

Paul van der Eerden,
Breaking software for over 20 years.
0 Karma
Get Updates on the Splunk Community!

Splunk Smartness with Brandon Sternfield | Episode 3

Hello and welcome to another episode of "Splunk Smartness," the interview series where we explore the power of ...

Monitoring Postgres with OpenTelemetry

Behind every business-critical application, you’ll find databases. These behind-the-scenes stores power ...

Mastering Synthetic Browser Testing: Pro Tips to Keep Your Web App Running Smoothly

To start, if you're new to synthetic monitoring, I recommend exploring this synthetic monitoring overview. In ...