Splunk Search

How do I alert if cpu is greater than 97% for more than 15m?

matthew_foos
Path Finder

Splunkers,

Looking for some kind of time modifier that will allow the following alert to fire only if CPU has been at 97% or higher for more than 15 minutes.

Here is the search string I've started working with:

index=perfmon sourcetype="Perfmon:CPU" counter="% Processor Time" instance=_Total
| stats max(cpu_load_percent) as load by host
| eval load = round(load, 2)
| where load >=97
| rename host as Host, load as "% Processor Time"

Any advice would be great.

Thanks.

0 Karma
1 Solution

Raschko
Communicator

You can use the streamstats command with time_window instead of stats.

Try this:

index=perfmon sourcetype="Perfmon:CPU" counter="% Processor Time" instance=_Total
| sort 0 _time
| streamstats time_window=15min avg(cpu_load_percent) as last15min_load count by host
| eval last15min_load = if(count < 18,null,round(last15min_load, 2))
| WHERE(last15min_load >= 97)
| table host, _time, cpu_load_percent, last15min_load, count

The streamstats command checks events of the last 15 min (by host) and calculates load average.
Furthermore it yields the count of events for use in the next eval command.

The eval line checks if event count is higher than 18 to make sure there are enough logged events for average calculation.
Otherwise you will get alerts at every reboot as there will be only 1 event with high load.
I took 18, because thats the event count I get within 3min from one host (1 event / 10sec ).

HTH

View solution in original post

Raschko
Communicator

You can use the streamstats command with time_window instead of stats.

Try this:

index=perfmon sourcetype="Perfmon:CPU" counter="% Processor Time" instance=_Total
| sort 0 _time
| streamstats time_window=15min avg(cpu_load_percent) as last15min_load count by host
| eval last15min_load = if(count < 18,null,round(last15min_load, 2))
| WHERE(last15min_load >= 97)
| table host, _time, cpu_load_percent, last15min_load, count

The streamstats command checks events of the last 15 min (by host) and calculates load average.
Furthermore it yields the count of events for use in the next eval command.

The eval line checks if event count is higher than 18 to make sure there are enough logged events for average calculation.
Otherwise you will get alerts at every reboot as there will be only 1 event with high load.
I took 18, because thats the event count I get within 3min from one host (1 event / 10sec ).

HTH

Get Updates on the Splunk Community!

Splunk + ThousandEyes: Correlate frontend, app, and network data to troubleshoot ...

 Are you tired of troubleshooting delays caused by siloed frontend, application, and network data? We've got a ...

Splunk Observability for AI

Don’t miss out on an exciting Tech Talk on Splunk Observability for AI!Discover how Splunk’s agentic AI ...

🔐 Trust at Every Hop: How mTLS in Splunk Enterprise 10.0 Makes Security Simpler

From Idea to Implementation: Why Splunk Built mTLS into Splunk Enterprise 10.0  mTLS wasn’t just a checkbox ...