Splunk Search

How to calculate the availability of an application using the number of errors per minute?

nravichandran
Communicator

I want to calculate availability of an application. The logic i am using is number of errors per minute.
So I am searching by _time and trying to get availability. The result is not returned.

"Error"    | bucket span=1m _time   |  stats  count by _time as t_err | eval avail=86400-t_err |  eval AvailPct = round((avail/86400)*100,2)| timechart span=1m sum(AvailPct)|RENAME sum(AvailPct) as "Avail.Pct"
Tags (3)
0 Karma
1 Solution

lguinn2
Legend

First, I think there is a problem with your math - for each minute, you are calculating the number of errors, and the subtracting that from the number of seconds in a day.

I think you will better off deciding what is "up" and what is "down", and then determining (by minute or second) if the application is available. For that time slot, availability is not a percentage, it is binary (up or down). An availability percentage only makes sense across a time frame, such as a day.

Here is an idea for the chart:

"Error"    
| bucket span=1m _time   
| stats  count by _time as t_err 
| t_err=if(t_err>0,1,0)
| timechart span=1m max(t_err) as status

In this chart, the entire minute is counted as "down" if there were any errors during that minute. If you show this as a bar chart, there will be a spike on the bar for each minute where the application was "down".

To calculate the availability percentage by day:

"Error"    
| bucket span=1s _time   
| stats  count by _time as t_err 
| t_err=if(t_err>0,1,0)
| bucket span=1d _time
| stats sum(t_err) as totalSecsDown by _time
| eval Percent_Available = round((86400-totalSecsDown)*100/86400,2)
| timechart span=1d max(Percent_Available) as Avail.Pct

This calculates an availability percentage by day, based on the number of seconds down.

Note that in both cases, I defined t_err to be "1" if there are any errors. That way, when Splunk adds up t_err, it is the number of seconds (or minutes), not the number of errors.

View solution in original post

0 Karma

aholzer
Motivator

As a starter, you are using the "as" incorrectly in your first stats ...|stats count by _time as t_err |..., you need to use rename in this case if you are renaming _time ... | stats count by _time as t_err | rename _time as t_err | ...

Also, rather than trying to use rename I suggest you use "AS" inside of the timechart command itself. Like so:

"Error"    | bucket span=1m _time | stats count by _time as t_err | rename _time as t_err | eval avail=86400-t_err | eval AvailPct = round((avail/86400)*100,2)| timechart span=1m sum(AvailPct) as "Avail.Pct"
0 Karma

lguinn2
Legend

First, I think there is a problem with your math - for each minute, you are calculating the number of errors, and the subtracting that from the number of seconds in a day.

I think you will better off deciding what is "up" and what is "down", and then determining (by minute or second) if the application is available. For that time slot, availability is not a percentage, it is binary (up or down). An availability percentage only makes sense across a time frame, such as a day.

Here is an idea for the chart:

"Error"    
| bucket span=1m _time   
| stats  count by _time as t_err 
| t_err=if(t_err>0,1,0)
| timechart span=1m max(t_err) as status

In this chart, the entire minute is counted as "down" if there were any errors during that minute. If you show this as a bar chart, there will be a spike on the bar for each minute where the application was "down".

To calculate the availability percentage by day:

"Error"    
| bucket span=1s _time   
| stats  count by _time as t_err 
| t_err=if(t_err>0,1,0)
| bucket span=1d _time
| stats sum(t_err) as totalSecsDown by _time
| eval Percent_Available = round((86400-totalSecsDown)*100/86400,2)
| timechart span=1d max(Percent_Available) as Avail.Pct

This calculates an availability percentage by day, based on the number of seconds down.

Note that in both cases, I defined t_err to be "1" if there are any errors. That way, when Splunk adds up t_err, it is the number of seconds (or minutes), not the number of errors.

0 Karma

nravichandran
Communicator

Thanks. When i used the second one (which is what i am looking for) i got error and modified by adding eval but did not get any results as chart but results are returned in the events, no visuvalization or stats.

"Error"

| bucket span=1s _time

| stats count by _time as t_err
|eval t_err=if(t_err>0,1,0)
| bucket span=1d _time
| stats sum(t_err) as totalSecsDown by _time
| eval Percent_Available = round((86400-totalSecsDown)*100/86400,2)
| timechart span=1d max(Percent_Available) as Avail.Pct

0 Karma
Career Survey
First 500 qualified respondents will receive a $20 gift card! Tell us about your professional Splunk journey.

Can’t make it to .conf25? Join us online!

Get Updates on the Splunk Community!

Can’t Make It to Boston? Stream .conf25 and Learn with Haya Husain

Boston may be buzzing this September with Splunk University and .conf25, but you don’t have to pack a bag to ...

Splunk Lantern’s Guide to The Most Popular .conf25 Sessions

Splunk Lantern is a Splunk customer success center that provides advice from Splunk experts on valuable data ...

Unlock What’s Next: The Splunk Cloud Platform at .conf25

In just a few days, Boston will be buzzing as the Splunk team and thousands of community members come together ...