Splunk Search

How do I alert if cpu is greater than 97% for more than 15m?

matthew_foos
Path Finder

Splunkers,

Looking for some kind of time modifier that will allow the following alert to fire only if CPU has been at 97% or higher for more than 15 minutes.

Here is the search string I've started working with:

index=perfmon sourcetype="Perfmon:CPU" counter="% Processor Time" instance=_Total
| stats max(cpu_load_percent) as load by host
| eval load = round(load, 2)
| where load >=97
| rename host as Host, load as "% Processor Time"

Any advice would be great.

Thanks.

0 Karma
1 Solution

Raschko
Communicator

You can use the streamstats command with time_window instead of stats.

Try this:

index=perfmon sourcetype="Perfmon:CPU" counter="% Processor Time" instance=_Total
| sort 0 _time
| streamstats time_window=15min avg(cpu_load_percent) as last15min_load count by host
| eval last15min_load = if(count < 18,null,round(last15min_load, 2))
| WHERE(last15min_load >= 97)
| table host, _time, cpu_load_percent, last15min_load, count

The streamstats command checks events of the last 15 min (by host) and calculates load average.
Furthermore it yields the count of events for use in the next eval command.

The eval line checks if event count is higher than 18 to make sure there are enough logged events for average calculation.
Otherwise you will get alerts at every reboot as there will be only 1 event with high load.
I took 18, because thats the event count I get within 3min from one host (1 event / 10sec ).

HTH

View solution in original post

Raschko
Communicator

You can use the streamstats command with time_window instead of stats.

Try this:

index=perfmon sourcetype="Perfmon:CPU" counter="% Processor Time" instance=_Total
| sort 0 _time
| streamstats time_window=15min avg(cpu_load_percent) as last15min_load count by host
| eval last15min_load = if(count < 18,null,round(last15min_load, 2))
| WHERE(last15min_load >= 97)
| table host, _time, cpu_load_percent, last15min_load, count

The streamstats command checks events of the last 15 min (by host) and calculates load average.
Furthermore it yields the count of events for use in the next eval command.

The eval line checks if event count is higher than 18 to make sure there are enough logged events for average calculation.
Otherwise you will get alerts at every reboot as there will be only 1 event with high load.
I took 18, because thats the event count I get within 3min from one host (1 event / 10sec ).

HTH

Get Updates on the Splunk Community!

Building Reliable Asset and Identity Frameworks in Splunk ES

 Accurate asset and identity resolution is the backbone of security operations. Without it, alerts are ...

Cloud Monitoring Console - Unlocking Greater Visibility in SVC Usage Reporting

For Splunk Cloud customers, understanding and optimizing Splunk Virtual Compute (SVC) usage and resource ...

Automatic Discovery Part 3: Practical Use Cases

If you’ve enabled Automatic Discovery in your install of the Splunk Distribution of the OpenTelemetry ...