SLO's often are defined as percentage success within a period of time. If you have a SLO defined just as a success rate without a time period, you are probably on a hiding to nothing! If you are trying to determine success rate for small periods of time, you would have to sample your data at that level or smaller. For example, if your SLO has a success rate of 97% in a five minute period, then you need to sample no larger than every 5 minutes. If you sample more frequently than this, these samples should be aggregated up to 5 minutes so that you can compare the success rate to your SLO. If you try to compare the success rate for the smaller time slices, e.g. every minute, they might give you an indication that your SLO might be breached, but since the SLO isn't defined at this level, it isn't technically a breach (until you compare the success rate for the full 5 minutes). You need to decide how important monitoring your success rate is, and at what frequency you want to measure it. This will factor into the cost of monitoring. SLOs are just part of a wider SRE approach and should be carefully considered and agreed. To answer your question another way, if you want to measure downtime or success rate to the second, you need the data (in Splunk) to support this level of detail. One thing to consider, even if you scheduled a report to execute every second, to try and pick up issues "as soon as possible", do you have the manpower to be sitting around waiting every second of every day in case a potential breach happens? What is an acceptable (and reasonable) delay between an issue arising and you being able to detect it? The smaller the delay, the most costly the solution is likely to be!
... View more