Deployment Architecture

Server availability reports

saurabhkunte
Path Finder

Hello,

I need to prepare a server availability chart depicting "uptime / downtime" represented on a line chart.

The monitoring tool always begins counting the uptime from the 1st observed host state as UP ( As seen in the data snippet below "CURRENT HOST STATE .....") As soon as the monitoring tool detects server is not reachable it changes its state to DOWN;SOFT it logs a HOST ALERT classification with a time stamp. It does a few retries to see if the host has recovered and once the check interval lapses and the host is still detected as unreachable it changes the state of the host to DOWN;HARD. It stays there until the monitoring detects the host is available again.
I need to prepare a line chart showing the time duration the server was up and the time duration the server was down. Any help in achieving this is highly appreciated.

My data looks like this :

time,"c_time",classification,"host_name","host_state","host_message"
1505278803,"09/13/2017 07:00:03","CURRENT HOST STATE","server1.contoso.com","UP;HARD","OK - 10.0.0.10 responds to ICMP. Packet 1, rta 45.499ms"
1505296351,"09/13/2017 11:52:31","HOST ALERT","server1.contoso.com","DOWN;SOFT","CRITICAL - 10.0.0.10: rta nan, lost 100%"
1505296368,"09/13/2017 11:52:48","HOST ALERT","server1.contoso.com","UP;HARD","OK - 10.0.0.10 responds to ICMP. Packet 1, rta 45.486ms"
1505299437,"09/13/2017 12:43:57","HOST ALERT","server1.contoso.com","DOWN;SOFT","CRITICAL - 10.0.0.10: rta nan, lost 100%"
1505299460,"09/13/2017 12:44:20","HOST ALERT","server1.contoso.com","DOWN;HARD","CRITICAL - 10.0.0.10: rta nan, lost 100%"
1505299608,"09/13/2017 12:46:48","HOST ALERT","server1.contoso.com","UP;HARD","OK - 10.0.0.10 responds to ICMP. Packet 1, rta 45.563ms"
1505308266,"09/13/2017 15:11:06","HOST ALERT","server1.contoso.com","DOWN;SOFT","CRITICAL - 10.0.0.10: rta nan, lost 100%"
1505308282,"09/13/2017 15:11:22","HOST ALERT","server1.contoso.com","UP;HARD","OK - 10.0.0.10 responds to ICMP. Packet 1, rta 45.671ms"
1505310169,"09/13/2017 15:42:49","HOST ALERT","server1.contoso.com","DOWN;SOFT","CRITICAL - 10.0.0.10: rta nan, lost 100%"
1505310194,"09/13/2017 15:43:14","HOST ALERT","server1.contoso.com","UP;HARD","OK - 10.0.0.10 responds to ICMP. Packet 1, rta 45.537ms"
1505310474,"09/13/2017 15:47:54","HOST ALERT","server1.contoso.com","DOWN;SOFT","CRITICAL - 10.0.0.10: rta nan, lost 100%"
1505310507,"09/13/2017 15:48:27","HOST ALERT","server1.contoso.com","DOWN;HARD","CRITICAL - 10.0.0.10: rta nan, lost 100%"
1505310550,"09/13/2017 15:49:10","HOST ALERT","server1.contoso.com","UP;HARD","OK - 10.0.0.10 responds to ICMP. Packet 1, rta 45.729ms"
1505313807,"09/13/2017 16:43:27","HOST ALERT","server1.contoso.com","DOWN;SOFT","CRITICAL - 10.0.0.10: rta nan, lost 100%"
1505313820,"09/13/2017 16:43:40","HOST ALERT","server1.contoso.com","UP;HARD","OK - 10.0.0.10 responds to ICMP. Packet 1, rta 45.686ms"
1505317401,"09/13/2017 17:43:21","HOST ALERT","server1.contoso.com","DOWN;SOFT","CRITICAL - 10.0.0.10: rta nan, lost 100%"
1505317446,"09/13/2017 17:44:06","HOST ALERT","server1.contoso.com","DOWN;HARD","CRITICAL - 10.0.0.10: rta nan, lost 100%"
1505317813,"09/13/2017 17:50:13","HOST ALERT","server1.contoso.com","UP;HARD","OK - 10.0.0.10 responds to ICMP. Packet 1, rta 45.579ms"
1505328210,"09/13/2017 20:43:30","HOST ALERT","server1.contoso.com","DOWN;SOFT","CRITICAL - 10.0.0.10: rta nan, lost 100%"
1505328278,"09/13/2017 20:44:38","HOST ALERT","server1.contoso.com","DOWN;HARD","CRITICAL - 10.0.0.10: rta nan, lost 100%"
1505328345,"09/13/2017 20:45:45","HOST ALERT","server1.contoso.com","UP;HARD","OK - 10.0.0.10 responds to ICMP. Packet 1, rta 45.523ms"
1505331558,"09/13/2017 21:39:18","HOST ALERT","server1.contoso.com","DOWN;SOFT","CRITICAL - 10.0.0.10: rta nan, lost 100%"
1505331621,"09/13/2017 21:40:21","HOST ALERT","server1.contoso.com","UP;HARD","OK - 10.0.0.10 responds to ICMP. Packet 1, rta 1357.259ms"

0 Karma

craigbowens
New Member

I don't know what you mean by "line 4 in the rex , after the first? mark and before the . write". Please elaborate.

0 Karma

Sukisen1981
Champion

well, i don't know what exactly you mean by - 'I need to prepare a line chart showing the time duration the server was up and the time duration the server was down'
you do realize that downtime is very very small compared uptime and having both on same time axis makes the graph looks very ugly, anyway here is the query :

| eval t=strptime(strftime(_time,"%m/%d/%Y %H:%M:%S"),"%m/%d/%Y %H:%M:%S" )
| reverse
| rex field=host_alert ^(?.*?)";"
| streamstats current=false last(st) as prevst,last(t) as prevt
| eval downtime=if((st="UP" AND(prevst="DOWN")) OR (st="DOWN" AND(prevst="DOWN")),round((t-prevt)/60,2),0)
| eval uptime=if(downtime=0,round((t-prevt)/60,2),0)
| fieldformat _time=strftime(_time,"%m/%d/%Y %H:%M")
| table _time,uptime,downtime

===
I recommend using the multiseries chart mode with Y axis independent. The stats column from the output will give you what you are looking for. uptime and downtime are calculated in minutes in the above query.

0 Karma

Sukisen1981
Champion

For some reason the rex did not get copied properly, use the below one instead.

| eval t=strptime(strftime(_time,"%m/%d/%Y %H:%M:%S"),"%m/%d/%Y %H:%M:%S" )
| reverse
| rex field=host_alert ^(?.*?)";"
| streamstats current=false last(st) as prevst,last(t) as prevt
| eval downtime=if((st="UP" AND(prevst="DOWN")) OR (st="DOWN" AND(prevst="DOWN")),round((t-prevt)/60,2),0)
| eval uptime=if(downtime=0,round((t-prevt)/60,2),0)
| fieldformat _time=strftime(_time,"%m/%d/%Y %H:%M")
| table _time,uptime,downtime

0 Karma

Sukisen1981
Champion

| eval host_message=m1+" " +m2
| eval t=strptime(strftime(_time,"%m/%d/%Y %H:%M:%S"),"%m/%d/%Y %H:%M:%S" )
| reverse
| rex field=host_alert ^(?.*?)";"
| streamstats current=false last(status) as prevst,last(t) as prevt
| eval downtime=if((status="UP" AND(prevst="DOWN")) OR (status="DOWN" AND(prevst="DOWN")),round((t-prevt)/60,2),0)
| eval uptime=if(downtime=0,round((t-prevt)/60,2),0)
| fieldformat _time=strftime(_time,"%m/%d/%Y %H:%M")
| table _time,uptime,downtime

0 Karma

Sukisen1981
Champion

hmm some issue with the pasting - line 4 in the rex , after the first? mark and before the . write

0 Karma
Get Updates on the Splunk Community!

Routing logs with Splunk OTel Collector for Kubernetes

The Splunk Distribution of the OpenTelemetry (OTel) Collector is a product that provides a way to ingest ...

Welcome to the Splunk Community!

(view in My Videos) We're so glad you're here! The Splunk Community is place to connect, learn, give back, and ...

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Elevating Digital Service Excellence: The Synergy of Real User Monitoring and Application Performance ...