How do I get slo breaches over period of time?

msrama5 · ‎03-24-2023

Hello, I am trying to measure the downtime or slo breaches of certain customer endpoints over period of time, for example the metrics are success rate, latency are something we measure for the endpoints, currently we capture and query splunk every 5 mins and get these values, the values when below < 97% for success rate are breaches , one issue we have with in that 5 mins the slo breaches could have lasted few secs and mins and not the entire 5 minutes, if we capture the data from splunk for every min data for success rate that will be too many queries hits to the splunk and storing 1440 values/day instead of 288 values/day when queried every 1 min + storage cost for storing data and parsing to compute slo breaches
1440 mins/5 mins = 288 values
1440 mins/ 1 mins = 1440 values

Any ideas how we can query splunk and get the threshold breaches accurately to secs so we can report downtime for prod incidents accurately to what is the amount of time the customer impact lasted with less hits to splunk and also more real time data provided to business on impact ?

ITWhisperer · ‎03-24-2023

SLO's often are defined as percentage success within a period of time. If you have a SLO defined just as a success rate without a time period, you are probably on a hiding to nothing! If you are trying to determine success rate for small periods of time, you would have to sample your data at that level or smaller. For example, if your SLO has a success rate of 97% in a five minute period, then you need to sample no larger than every 5 minutes. If you sample more frequently than this, these samples should be aggregated up to 5 minutes so that you can compare the success rate to your SLO. If you try to compare the success rate for the smaller time slices, e.g. every minute, they might give you an indication that your SLO might be breached, but since the SLO isn't defined at this level, it isn't technically a breach (until you compare the success rate for the full 5 minutes).

You need to decide how important monitoring your success rate is, and at what frequency you want to measure it. This will factor into the cost of monitoring. SLOs are just part of a wider SRE approach and should be carefully considered and agreed.

To answer your question another way, if you want to measure downtime or success rate to the second, you need the data (in Splunk) to support this level of detail.

One thing to consider, even if you scheduled a report to execute every second, to try and pick up issues "as soon as possible", do you have the manpower to be sitting around waiting every second of every day in case a potential breach happens? What is an acceptable (and reasonable) delay between an issue arising and you being able to detect it? The smaller the delay, the most costly the solution is likely to be!

gcusello · ‎03-24-2023

Hi @msrama5 ,

you could schedule a searches (e.g. every 5 minutes) that calculates downtime or slo breaches and saves results in a summary index even because I suppose that these searches aren't so fast.

Then you can rus a very quick search on the summary index to calculate the everages and the max values.

Ciao.

Giuseppe

How do I get slo breaches over period of time?

scheduled search

Announcing Scheduled Export GA for Dashboard Studio

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!