Solved: How do I alert if cpu is greater than 97% for more...

matthew_foos · ‎10-22-2018

Splunkers,

Looking for some kind of time modifier that will allow the following alert to fire only if CPU has been at 97% or higher for more than 15 minutes.

Here is the search string I've started working with:

index=perfmon sourcetype="Perfmon:CPU" counter="% Processor Time" instance=_Total
| stats max(cpu_load_percent) as load by host
| eval load = round(load, 2)
| where load >=97
| rename host as Host, load as "% Processor Time"

Any advice would be great.

Thanks.

Raschko · ‎10-22-2018

You can use the streamstats command with time_window instead of stats.

Try this:

index=perfmon sourcetype="Perfmon:CPU" counter="% Processor Time" instance=_Total
| sort 0 _time
| streamstats time_window=15min avg(cpu_load_percent) as last15min_load count by host
| eval last15min_load = if(count < 18,null,round(last15min_load, 2))
| WHERE(last15min_load >= 97)
| table host, _time, cpu_load_percent, last15min_load, count

The streamstats command checks events of the last 15 min (by host) and calculates load average.
Furthermore it yields the count of events for use in the next eval command.

The eval line checks if event count is higher than 18 to make sure there are enough logged events for average calculation.
Otherwise you will get alerts at every reboot as there will be only 1 event with high load.
I took 18, because thats the event count I get within 3min from one host (1 event / 10sec ).

HTH

View solution in original post

Raschko · ‎10-22-2018

You can use the streamstats command with time_window instead of stats.

Try this:

index=perfmon sourcetype="Perfmon:CPU" counter="% Processor Time" instance=_Total
| sort 0 _time
| streamstats time_window=15min avg(cpu_load_percent) as last15min_load count by host
| eval last15min_load = if(count < 18,null,round(last15min_load, 2))
| WHERE(last15min_load >= 97)
| table host, _time, cpu_load_percent, last15min_load, count

The streamstats command checks events of the last 15 min (by host) and calculates load average.
Furthermore it yields the count of events for use in the next eval command.

The eval line checks if event count is higher than 18 to make sure there are enough logged events for average calculation.
Otherwise you will get alerts at every reboot as there will be only 1 event with high load.
I took 18, because thats the event count I get within 3min from one host (1 event / 10sec ).

HTH

How do I alert if cpu is greater than 97% for more than 15m?

Splunk + ThousandEyes: Correlate frontend, app, and network data to troubleshoot ...

Splunk Observability for AI

🔐 Trust at Every Hop: How mTLS in Splunk Enterprise 10.0 Makes Security Simpler

Are you a member of the Splunk Community?

How do I alert if cpu is greater than 97% for more than 15m?

Splunk + ThousandEyes: Correlate frontend, app, and network data to troubleshoot ...

Splunk Observability for AI

🔐 Trust at Every Hop: How mTLS in Splunk Enterprise 10.0 Makes Security Simpler