Splunk Search

How do I alert if cpu is greater than 97% for more than 15m?

matthew_foos
Path Finder

Splunkers,

Looking for some kind of time modifier that will allow the following alert to fire only if CPU has been at 97% or higher for more than 15 minutes.

Here is the search string I've started working with:

index=perfmon sourcetype="Perfmon:CPU" counter="% Processor Time" instance=_Total
| stats max(cpu_load_percent) as load by host
| eval load = round(load, 2)
| where load >=97
| rename host as Host, load as "% Processor Time"

Any advice would be great.

Thanks.

0 Karma
1 Solution

Raschko
Communicator

You can use the streamstats command with time_window instead of stats.

Try this:

index=perfmon sourcetype="Perfmon:CPU" counter="% Processor Time" instance=_Total
| sort 0 _time
| streamstats time_window=15min avg(cpu_load_percent) as last15min_load count by host
| eval last15min_load = if(count < 18,null,round(last15min_load, 2))
| WHERE(last15min_load >= 97)
| table host, _time, cpu_load_percent, last15min_load, count

The streamstats command checks events of the last 15 min (by host) and calculates load average.
Furthermore it yields the count of events for use in the next eval command.

The eval line checks if event count is higher than 18 to make sure there are enough logged events for average calculation.
Otherwise you will get alerts at every reboot as there will be only 1 event with high load.
I took 18, because thats the event count I get within 3min from one host (1 event / 10sec ).

HTH

View solution in original post

Raschko
Communicator

You can use the streamstats command with time_window instead of stats.

Try this:

index=perfmon sourcetype="Perfmon:CPU" counter="% Processor Time" instance=_Total
| sort 0 _time
| streamstats time_window=15min avg(cpu_load_percent) as last15min_load count by host
| eval last15min_load = if(count < 18,null,round(last15min_load, 2))
| WHERE(last15min_load >= 97)
| table host, _time, cpu_load_percent, last15min_load, count

The streamstats command checks events of the last 15 min (by host) and calculates load average.
Furthermore it yields the count of events for use in the next eval command.

The eval line checks if event count is higher than 18 to make sure there are enough logged events for average calculation.
Otherwise you will get alerts at every reboot as there will be only 1 event with high load.
I took 18, because thats the event count I get within 3min from one host (1 event / 10sec ).

HTH

Get Updates on the Splunk Community!

Automatic Discovery Part 1: What is Automatic Discovery in Splunk Observability Cloud ...

If you’ve ever deployed a new database cluster, spun up a caching layer, or added a load balancer, you know it ...

Real-Time Fraud Detection: How Splunk Dashboards Protect Financial Institutions

Financial fraud isn't slowing down. If anything, it's getting more sophisticated. Account takeovers, credit ...

Splunk + ThousandEyes: Correlate frontend, app, and network data to troubleshoot ...

 Are you tired of troubleshooting delays caused by siloed frontend, application, and network data? We've got a ...