Hello! Newbie here- We are monitoring a primary metric that has several different categories (tags), each with its own unique "normal" behavior. This makes it difficult to set up a universal alert system for when an issue occurs. The metric tracks a specific user action, and the categories represent the different types of that action. There are six distinct types, and while all of them generally show a drop in activity between 2 and 6 a.m., the extent of this drop varies significantly across the categories.
My ask is , because this feels like a common scenario, what are we missing? What else can we alert on or monitor for this scenario?
Here is a breakdown of the typical behavior for each category:
Category A: Fairly consistent traffic during the day, rarely dropping to zero.
Category B: Less constant traffic during the day; it can drop to zero during off-peak hours.
Category C: Less traffic and less consistency than Category B; it can also drop to zero during off-peak hours.
Category 😧 Even less consistent traffic during the day; it can drop to zero during off-peak hours.
Category E: Very inconsistent traffic; it can drop to zero for longer periods, including the times outside off-peak hours.
Category F: Very inconsistent traffic, similar to Category E, that can also drop to zero for extended periods and more often.
We are currently have an SLO dashboard and have set up a few different detectors to test various alerting strategies. You can see more easily how the traffic differs in the dashboard.
Detector 1
Trigger: An alert is sent if the failure rate for the primary metric exceeds 5% and the total number of actions meets a minimum threshold of 30. This condition must last for 15 minutes.
Detector 2
Trigger: An alert is sent if the 20-minute moving average of the primary metric for Category A falls more than three standard deviations below its historical norm. This is based on the assumption that the data is cyclical over a one-week period.
Detector 3
Trigger: An alert is sent if the 20-minute moving average of the primary metric for Category B falls more than three standard deviations below its historical norm. This, like Detector 2, is also based on a one-week cyclical pattern.
Detector 4
Trigger: An alert is sent if the value of the primary metric for Category A drops below one for a continuous period of 45 minutes.
Detector 5
Trigger: An alert is sent if the 20-minute moving average of the overall primary metric (summed across all categories) falls more than three standard deviations below its historical norm. This detector also assumes the data follows a cyclical pattern over a one-week period.
It depends on what you are trying to achieve. Solutions should be borne from requirements. Are there anomalies that led to issues that you would have like to have detected sooner? Is there any correlation between the changes in the different categories around the time of the issue? You could try modelling/monitoring at a meta level, i.e. changes against historic changes by the minute for example for the primary metric and/or the categorised metrics, perhaps there is value in that. Essentially, you would need to do some analysis of your data to see if there are any patterns you want to try and detect.
So these are completely new metrics that were created as a result of an incident where we realized we were lacking visibility. While failures are easier to detect, my concern is with the lack of events. We are hoping to try to detect anomalies however we don't know what the next incident will look like. Sure, a complete drop off is detectable however if its a partial drop off, on a metric that has categories that include all different levels of drop offs….how does one do this?