Hello! Newbie here- We are monitoring a primary metric that has several different categories (tags), each with its own unique "normal" behavior. This makes it difficult to set up a universal alert system for when an issue occurs. The metric tracks a specific user action, and the categories represent the different types of that action. There are six distinct types, and while all of them generally show a drop in activity between 2 and 6 a.m., the extent of this drop varies significantly across the categories. My ask is , because this feels like a common scenario, what are we missing? What else can we alert on or monitor for this scenario? Here is a breakdown of the typical behavior for each category: Category A: Fairly consistent traffic during the day, rarely dropping to zero. Category B: Less constant traffic during the day; it can drop to zero during off-peak hours. Category C: Less traffic and less consistency than Category B; it can also drop to zero during off-peak hours. Category 😧 Even less consistent traffic during the day; it can drop to zero during off-peak hours. Category E: Very inconsistent traffic; it can drop to zero for longer periods, including the times outside off-peak hours. Category F: Very inconsistent traffic, similar to Category E, that can also drop to zero for extended periods and more often. We are currently have an SLO dashboard and have set up a few different detectors to test various alerting strategies. You can see more easily how the traffic differs in the dashboard. Alerting Strategies Detector 1 Trigger: An alert is sent if the failure rate for the primary metric exceeds 5% and the total number of actions meets a minimum threshold of 30. This condition must last for 15 minutes. Detector 2 Trigger: An alert is sent if the 20-minute moving average of the primary metric for Category A falls more than three standard deviations below its historical norm. This is based on the assumption that the data is cyclical over a one-week period. Detector 3 Trigger: An alert is sent if the 20-minute moving average of the primary metric for Category B falls more than three standard deviations below its historical norm. This, like Detector 2, is also based on a one-week cyclical pattern. Detector 4 Trigger: An alert is sent if the value of the primary metric for Category A drops below one for a continuous period of 45 minutes. Detector 5 Trigger: An alert is sent if the 20-minute moving average of the overall primary metric (summed across all categories) falls more than three standard deviations below its historical norm. This detector also assumes the data follows a cyclical pattern over a one-week period.
... View more