Hi guys. I want to be able to calculate downtime based on the amount of requests that an Application server processes. The downtime is calculated based on the following rules.
Below is an example of the result I want to calculate downtime on:
Here is my method to get the top 80% count, using the percentile top 80% counts, and qualify every minute as up or downtime based on this value.
index=_internal source=*web* req_time =*
| bucket _time span=1m | stats count by _time
| eventstats perc80(count) AS maxperc80
| eval status=if(count < maxperc80, "down", "up")
You probably want to add some sort of count of consecutive durations and exclude the outliers
Then do the sum of the "down" minutes.
| stats count by status
...|top 20 status| stats avg(count)
hi, one more things. how do we add step number 2 above to the search where we take the average of the top 20 results.
I know this is not what you are asking but, based on your example which shows an obvious 100% (full vs. partial) outage, why would you not use something like this:
... | streamstats current=f latest(_time) AS prevEventTime latest(_raw) AS prevEvent | eval downtime = _time - _prevEventTime | where downtime > 100
Thanks for your input. I have something similar in-place already, however point number 2 above is an important part of the search to be able to calculate the downtime in a proper way.