Community Blog
Get the latest updates on the Splunk Community, including member experiences, product education, events, and more!

Detector Best Practices: Static Thresholds

CaitlinHalla
Splunk Employee
Splunk Employee

Introduction

In observability monitoring, static thresholds are used to monitor fixed, known values within application environments. If a signal goes above or below a static threshold or is within or outside a specified range, an alert will fire. Static thresholds are quick to configure and can provide helpful insight into system stats, but there are downsides to using them too frequently. There’s a time and a place for static thresholds, and in this post, we’ll look at when to use static thresholds, when not to use them, and alternatives to static thresholds. 

When to use static thresholds

Static thresholds work well for situations where there are “known knowns” – the predictable cases that you can anticipate – and when there’s a static range of “good” and “bad” values. Here’s an example of a CPU utilization detector with an alert rule that triggers when CPU is above 90% for a 5 minute duration: 

cpu 90 for 5 m.png
 

As a side note, adding durations on such alert rules is important to avoid over-alerting on transient spikes. Without a set duration, every CPU spike over 90% would trigger an alert, as we can see when we try to configure the same condition without a duration: 

cpu utilization static.png

The estimated alert count for this alert rule is 11 alerts in 1 hour – aka too much alert noise. 

Boolean conditions, like monitoring synthetic test failures, are a great case for static thresholds. Alerting when a synthetic check fails indicates issues with site availability that should be addressed immediately. This detector alerts on a synthetic test when uptime drops below the 90% static threshold: 

synthetic browser test.png
 

You can also use static thresholds to monitor Service Level Objectives/Service Level Agreements/Service Level Indicators (SLO/SLA/SLI). If you have an SLO of 99.9% uptime, you’ll want to be alerted anytime your availability approaches that threshold. Here’s an SLO detector that alerts if latency passes the 99.99999% threshold: 

SLO.png
 

While configuring static thresholds on SLOs is possible, within Splunk Observability Cloud it’s recommended to manage SLO alerts using error budgets

Static thresholds are most appropriate when working with fixed and critical metrics with clear failure conditions, e.g. HTTP error response codes spiking, response time increasing for a certain period of time, error rates above a certain percentage for a certain amount of time. If you’re just starting out on your observability journey, using static thresholds is also a great way to capture baseline metrics and gain insight into trends so you can fine-tune and adjust your detectors and alerts. 

When not to use static thresholds

As we saw in the CPU detector above, alerting on static thresholds can lead to a lot of alert noise if not used correctly. To create good detectors, we need to alert on actionable signals based on symptoms, not causes (we have a whole post on How to Create Good Detectors). 

In dynamic system environments that include autoscaling and fluctuating traffic, static thresholds might not indicate actual system failures. Setting static thresholds on pod CPU in a Kubernetes environment, for example, could indicate increased load, but might not indicate a problem – pods could autoscale to handle the increase just fine. 

Note: monitoring the CPU static threshold combined with pod queue length could provide the additional context needed to create an actionable detector. 

When there are periods of traffic spikes – Black Friday for e-commerce sites, lunchtime for a food delivery site – setting static thresholds can lead to an increase in false alarms. Applications with such variable traffic fluctuations might not benefit from alerting on static thresholds alone. 

Dynamic resource allocation and variable usage patterns can make using static thresholds in isolation tricky, but thankfully, alternative approaches can help.

Alternatives to static thresholds

For the situations mentioned above, along with those where we might not know all of our application’s failure states (unknown-unknowns), alternative detector thresholding or a hybrid approach can work best. Alternatives include but definitely aren’t limited to: 

  • Ratio-based thresholds – instead of alerting on 500 MB of memory used, use a threshold of 80% of total memory for a specific duration of time
  • Combining static thresholds with additional context – high CPU + pod queuing or error rate or latency 
  • Sudden change detection – alerts on sudden changes to specified conditions like number of logins, response time, etc. 
  • Historical anomaly detection to baseline environments and alert on deviations from trends – alerting on latency that deviates from historical trends

Combining out-of-the-box approaches with custom thresholds can help you build a resilient monitoring solution to keep your applications running smoothly and your users happy. 

Wrap up 

For predictable, critical metrics static thresholds are a great alerting option. For situations when static thresholds aren’t appropriate, there are thankfully additional solutions that can help. To start, check out Splunk Observability Cloud’s out-of-the-box alert conditions here and explore what might work best for your unique application environment. Don’t yet have Splunk Observability Cloud? Try it free for 14 days

Get Updates on the Splunk Community!

Observability Release Update: AI Assistant, AppD + Observability Cloud Integrations & ...

This month’s releases across the Splunk Observability portfolio deliver earlier detection and faster ...

Stay Connected: Your Guide to February Tech Talks, Office Hours, and Webinars!

💌Keep the new year’s momentum going with our February lineup of Community Office Hours, Tech Talks, ...

Preparing your Splunk Environment for OpenSSL3

The Splunk platform will transition to OpenSSL version 3 in a future release. Actions are required to prepare ...