Splunk Search

Create an alert based on CPU being at 95% for a span of 10 minutes

JPaule
Explorer

I'm having issues creating an alert that looks at lets say 100 different hosts, but I need to get an alert if one or more host is at 95% CPU usage for a span of 10 min.

Here is what I have so far but it doesn't seem to work because for example if 3 hosts are over 95% for let's say 4min each, I get an alert. Although, I only need the alert if 1 server is over 98% for 10 min:

index=xxx counter="% Processor Time" sourcetype="perfmon:processor" Value > 95 | timechart span=10m eval(round(avg(Value),1)) by host useother=f

I found a couple similar posts on this question, but none of the proposed solutions worked.

Tags (3)
0 Karma

DalJeanis
Legend

@Jpaule - has your need been met, or do you still need help with this one? If the rsponse worked for you, please accept it so that the question will show as answered. Thanks!

0 Karma

DalJeanis
Legend

Okay, here's the way you need to look at this. You CAN'T kill the records that are under 95%, because those are what tells you that you DON'T want to alert.

your search that brings back all records for each host with _time, host, CPUpct
| bin _time span=1m
| stats max(CPUpct) as CPUpct by host _time
| eval overload=if(CPUpct>=95,1,0)
| rename COMMENT as "the above gets the worst CPU stat in the period and calculates whether that period is over the threshold"

| rename COMMENT as "check whether the overload value has changed, mark the group number and count how many events in the group."
| streamstats current=f last(overload) as prevload by host
| eval newgroup=case(isnull(prevload),1, prevload!=overload,1, true(),0) 
| streamstats sum(newgroup) as groupno by host
| eventstats count as groupsize by host groupno

| rename COMMENT as "let through only groups that show CPU overload and are 10 or more minutes long" 
| where overload=1 AND groupsize >= 10

Dang. Couldn't find the one I wrote a couple of months back, but here are three, depending on how you want your results.

https://answers.splunk.com/answers/374869/how-to-create-and-trigger-an-alert-if-the-cpu-usag.html

https://answers.splunk.com/answers/102865/how-to-alert-user-when-the-processor-time-exceeds-a-certai...

https://answers.splunk.com/answers/462460/how-to-create-an-alert-to-trigger-when-a-host-exce.html

rahulkumarfgf
Explorer

Hi @DalJeanis !

I has similar issue where I want to trigger an alert if CPU usage is 100% for more than 10min. I am using % processor TIme instaed of CPUpct. Wanted to knw if that will provide the same result. Here is my modified SPL:

index="perfmoncpu" source="PerfmonMk:CPU" | bin _time span=1m

| stats avg(%_Processor_Time) as PercentProcessorTime by host _time
|eval PercentProcessorTime = round(PercentProcessorTime, 2)
|eval overload = if(PercentProcessorTime >= 100, 1, 0)
|streamstats current=f last(overload) as prevload by host
|eval newgroup=case(isnull(prevload),1, prevload!=overload,1, true(),0)
|streamstats sum(newgroup) as groupno by host
|eventstats count as groupsize by host groupno
|where overload=1 AND groupsize >= 10
|table overload, host, PercentProcessorTime

Thank you for your help!

0 Karma
Got questions? Get answers!

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Meet up IRL or virtually!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Get Updates on the Splunk Community!

Network to App: Observability Unlocked [May & June Series]

In today’s digital landscape, your environment is no longer confined to the data center. It spans complex ...

SPL2 Deep Dives, AppDynamics Integrations, SAML Made Simple and Much More on Splunk ...

Splunk Lantern is Splunk’s customer success center that provides practical guidance from Splunk experts on key ...

[Puzzles] Solve, Learn, Repeat: Matching cron expressions

This puzzle (first published here) is based on matching timestamps to cron expressions.All the timestamps ...