Alerting Best Practices: How to Create Good Detectors

CaitlinHalla · ‎11-14-2024

At their best, detectors and the alerts they trigger notify teams when applications aren’t performing as expected and help quickly resolve errors to keep things up and running. At their worst, they cause an overwhelming amount of noise that fatigues our engineering teams, distracts from real application issues, and decreases the reliability of our products. In this post, we’ll look at ways we can skew toward the former to keep our detectors helpful, our engineers unpaged by unnecessary alerts, our applications performant, and our users happy.

What are detectors?

Detectors set conditions that determine when to alert or notify engineering teams of application issues. These detectors monitor signals and trigger alerts when the signals meet specified rules. These rules are defined either by engineering teams or automatically by third-party observability platforms, like Splunk Observability Cloud’s AutoDetect alerts and detectors.

It can be tempting to create detectors around every possible static threshold that might signal that our system is in an undesirable state – CPU above a certain percentage, disk space low, throughput high, etc. However, creating too many detectors and setting off too many alerts doesn’t always lead to actionable solutions and causes alert fatigue. Creating the right detectors can help minimize false positives or negatives, reduce alert noise, and improve troubleshooting while reducing downtime and fatigue.

How to create good detectors

It’s important to thoughtfully and intentionally create detectors to keep them as meaningful and helpful as possible. Detectors and the alerts they trigger should be:

Signals of an actual emergency or a signal that there’s about to be an emergency
Actionable and based on symptoms, not causes
Well-documented
Built using Observability as Code
Iterated upon

Let’s dig into each of these best practices.

Detectors should signal an actual emergency or impending emergency

The Google SRE book defines alerting as, “Something is broken, and somebody needs to fix it right now! Or, something might break soon, so somebody should look soon.” Preferably, our system is self-healing and can autoscale, load balance, and retry to minimize human intervention. If these things can’t happen, and our users aren’t able to access our site, are experiencing slow load times, or are losing data, then our teams should be paged because there’s a real emergency that requires human intervention.

If the alert being fired doesn’t meet these criteria, it should perhaps be a chart living in a dashboard or a factor contributing to the service’s health score instead of an alert that could potentially wake engineers up in the middle of the night. For example, high throughput could lead to slower response times and a degraded user experience, but it might not (e.g. when CPU spikes precede auto-scaling events). Instead of creating a detector around high throughput itself, alerting on throughput thresholds and metrics surrounding latency or error rates would indicate an official or looming emergency.

At the end of the day, your application exists to serve a purpose. If that purpose isn’t being served or is in danger of not being served, alerts need to fire. If your infrastructure is running extremely hot, but the application is working, you can probably wait on alerting until business hours to look at additional capacity. But if users are being impacted, that’s a real emergency.

Detectors should be actionable

Human intervention is expensive. It interrupts work on deliverables, distracts from other potential incidents, and can unnecessarily stress out and exhaust the humans involved – either by literally waking them from sleep or by fatigue from too many alerts. Therefore, when detectors and the alerts they trigger are created, they should define actionable states.

If a detector sets a static threshold on something like CPU and creates an alerting rule that triggers an alert when CPU spikes above 80%, the human responding to that page will go into an observability platform, see the CPU spike, and may not have much more information right off the bat. Are error rates spiking? Are response times slow? Are customers impacted? Is CPU spiking because of garbage collection or backup processing? Is any action necessary? Static thresholds like these don’t provide much context around end-user impact or troubleshooting. Instead, creating detectors and alerts around the symptoms or the things that impact user experience (think Service Level Objectives), means that there’s something wrong, there’s a real emergency, and a human needs to step in and take action.

Some examples of actionable symptoms include:

HTTP response codes like 500s or 404s
Slow response time
Users aren’t seeing expected content
Private content is exposed to unauthorized users

Here’s a good example of a Splunk Observability Cloud detector that alerts when the error rate for the API Service is above 5% over a 5-minute time period:

API error rate.png

Alerts would be triggered where the red triangles are shown.

Another example of a seemingly good detector is one that alerts when disk utilization percentage for a specific host goes above 80% for a specified amount of time:

add time period.png

However, this specific host initiates a data dump to S3 every 48 hours at 4 am EST, and this data dump causes a period of increased disk utilization. We can see in the alert settings that this alert is predicted to trigger in 48 hours when our S3 dump will occur, and we definitely don’t want to page the engineering team that manages this host at 4 am. We could adjust the singal resolution to a longer time period if needed, or we could create a muting rule that suppresses notifications for this specific condition:

muting rule.png

The downside to this is that we could potentially miss a real incident, so we might also want to create a second detector around disk utilization percentage with a slightly higher threshold (that wouldn’t alert with our data dump spike) to monitor utilization during this time period.

Detectors should be well-documented

Responding to an alert triggered by a detector should not require institutional knowledge and pulling multiple humans in to resolve the incident. There are times when this will happen, but detectors should be well-documented with runbooks for quick and easy guidance on troubleshooting steps.

Notifying the right people and setting the right escalation policies should also be considered in the configuration of detectors and alerts. For example, it doesn’t make sense to page the entire engineering team for every incident; it does make sense to page the code owners for the service experiencing issues. It doesn’t make sense to page leadership teams if a handful of customers are experiencing latency; it does make sense to page leadership teams if an entire product is offline.

Detectors should be built using Observability as Code

Speaking of documentation… creating detectors and alerts as code provides a source-controlled documentation trail of why, when, and by whom the detector was created. This means that any new members joining the team will have background knowledge on the detectors and alerts that they will be on-call for. It also creates an audit trail when fine-tuning and adjusting thresholds, rules, notification channels, etc.

Using Observability as Code also makes it easier to standardize detectors and alerts between services, environments, and teams. This standardization means that all members of all teams have at least a baseline knowledge of the detectors and alerts and can potentially have greater context and ability to jump in and help during an incident.

Here’s an example of creating a Splunk Observability Cloud detector using Terraform:

detector terraform.png

Check out our previous post to learn more about deploying observability configuration via Terraform.

Detectors should be iterated upon

Detectors and alerts should work for our teams, not against them, and they should constantly evolve with our services to improve the reliability of our systems. To keep detectors and alerts actionable and relevant, it’s important to update, fine-tune, or even delete them when they’re no longer effective or helpful. If alerts are frequently firing but require no action, this is a sign to adjust thresholds or re-evaluate the need for the detector. If the actions taken by the incident response team could be easily automated, this is a sign to adjust or remove the detector. If anyone on your team uses the word “ignore” when describing a detector or alert, this is a sign to adjust or remove the detector. If team members are unable to troubleshoot an incident effectively based on the information they receive from a detector or alert, this is a sign to iterate on the detector.

Alerting detectors should never be the norm. If a detector is alerting, action should be required. Having a board full of “business as usual” alerting detectors quickly leads to false positives, false negatives, and ultimately hurts customer experience.

Wrap up

Detectors and alerts are critical to keeping our systems up and running, but too many detectors can work against that goal. Good detectors:

Signal real emergencies
Indicate an actual issue with user experience and provide our on-call teams with actionable insight
Are well-documented
Are built using source control
Are continuously fine-tuned

These things help keep our detectors and alerts useful, our engineering teams focused, and our users happy.

Interested in creating some good detectors in Splunk Observability Cloud? Try out a free 14-day trial!

Resources

Alerting Best Practices: How to Create Good Detectors

What are detectors?