Splunk Search

How to calculate the availability of an application using the number of errors per minute?

Communicator

I want to calculate availability of an application. The logic i am using is number of errors per minute.
So I am searching by _time and trying to get availability. The result is not returned.

"Error"    | bucket span=1m _time   |  stats  count by _time as t_err | eval avail=86400-t_err |  eval AvailPct = round((avail/86400)*100,2)| timechart span=1m sum(AvailPct)|RENAME sum(AvailPct) as "Avail.Pct"
Tags (3)
0 Karma
1 Solution

Legend

First, I think there is a problem with your math - for each minute, you are calculating the number of errors, and the subtracting that from the number of seconds in a day.

I think you will better off deciding what is "up" and what is "down", and then determining (by minute or second) if the application is available. For that time slot, availability is not a percentage, it is binary (up or down). An availability percentage only makes sense across a time frame, such as a day.

Here is an idea for the chart:

"Error"    
| bucket span=1m _time   
| stats  count by _time as t_err 
| t_err=if(t_err>0,1,0)
| timechart span=1m max(t_err) as status

In this chart, the entire minute is counted as "down" if there were any errors during that minute. If you show this as a bar chart, there will be a spike on the bar for each minute where the application was "down".

To calculate the availability percentage by day:

"Error"    
| bucket span=1s _time   
| stats  count by _time as t_err 
| t_err=if(t_err>0,1,0)
| bucket span=1d _time
| stats sum(t_err) as totalSecsDown by _time
| eval Percent_Available = round((86400-totalSecsDown)*100/86400,2)
| timechart span=1d max(Percent_Available) as Avail.Pct

This calculates an availability percentage by day, based on the number of seconds down.

Note that in both cases, I defined t_err to be "1" if there are any errors. That way, when Splunk adds up t_err, it is the number of seconds (or minutes), not the number of errors.

View solution in original post

0 Karma

Motivator

As a starter, you are using the "as" incorrectly in your first stats ...|stats count by _time as t_err |..., you need to use rename in this case if you are renaming _time ... | stats count by _time as t_err | rename _time as t_err | ...

Also, rather than trying to use rename I suggest you use "AS" inside of the timechart command itself. Like so:

"Error"    | bucket span=1m _time | stats count by _time as t_err | rename _time as t_err | eval avail=86400-t_err | eval AvailPct = round((avail/86400)*100,2)| timechart span=1m sum(AvailPct) as "Avail.Pct"
0 Karma

Legend

First, I think there is a problem with your math - for each minute, you are calculating the number of errors, and the subtracting that from the number of seconds in a day.

I think you will better off deciding what is "up" and what is "down", and then determining (by minute or second) if the application is available. For that time slot, availability is not a percentage, it is binary (up or down). An availability percentage only makes sense across a time frame, such as a day.

Here is an idea for the chart:

"Error"    
| bucket span=1m _time   
| stats  count by _time as t_err 
| t_err=if(t_err>0,1,0)
| timechart span=1m max(t_err) as status

In this chart, the entire minute is counted as "down" if there were any errors during that minute. If you show this as a bar chart, there will be a spike on the bar for each minute where the application was "down".

To calculate the availability percentage by day:

"Error"    
| bucket span=1s _time   
| stats  count by _time as t_err 
| t_err=if(t_err>0,1,0)
| bucket span=1d _time
| stats sum(t_err) as totalSecsDown by _time
| eval Percent_Available = round((86400-totalSecsDown)*100/86400,2)
| timechart span=1d max(Percent_Available) as Avail.Pct

This calculates an availability percentage by day, based on the number of seconds down.

Note that in both cases, I defined t_err to be "1" if there are any errors. That way, when Splunk adds up t_err, it is the number of seconds (or minutes), not the number of errors.

View solution in original post

0 Karma

Communicator

Thanks. When i used the second one (which is what i am looking for) i got error and modified by adding eval but did not get any results as chart but results are returned in the events, no visuvalization or stats.

"Error"

| bucket span=1s _time

| stats count by _time as t_err
|eval t_err=if(t_err>0,1,0)
| bucket span=1d _time
| stats sum(t_err) as totalSecsDown by _time
| eval Percent_Available = round((86400-totalSecsDown)*100/86400,2)
| timechart span=1d max(Percent_Available) as Avail.Pct

0 Karma