Splunk Search

Create an alert based on CPU being at 95% for a span of 10 minutes

JPaule
Explorer

I'm having issues creating an alert that looks at lets say 100 different hosts, but I need to get an alert if one or more host is at 95% CPU usage for a span of 10 min.

Here is what I have so far but it doesn't seem to work because for example if 3 hosts are over 95% for let's say 4min each, I get an alert. Although, I only need the alert if 1 server is over 98% for 10 min:

index=xxx counter="% Processor Time" sourcetype="perfmon:processor" Value > 95 | timechart span=10m eval(round(avg(Value),1)) by host useother=f

I found a couple similar posts on this question, but none of the proposed solutions worked.

Tags (3)
0 Karma

DalJeanis
SplunkTrust
SplunkTrust

@Jpaule - has your need been met, or do you still need help with this one? If the rsponse worked for you, please accept it so that the question will show as answered. Thanks!

0 Karma

DalJeanis
SplunkTrust
SplunkTrust

Okay, here's the way you need to look at this. You CAN'T kill the records that are under 95%, because those are what tells you that you DON'T want to alert.

your search that brings back all records for each host with _time, host, CPUpct
| bin _time span=1m
| stats max(CPUpct) as CPUpct by host _time
| eval overload=if(CPUpct>=95,1,0)
| rename COMMENT as "the above gets the worst CPU stat in the period and calculates whether that period is over the threshold"

| rename COMMENT as "check whether the overload value has changed, mark the group number and count how many events in the group."
| streamstats current=f last(overload) as prevload by host
| eval newgroup=case(isnull(prevload),1, prevload!=overload,1, true(),0) 
| streamstats sum(newgroup) as groupno by host
| eventstats count as groupsize by host groupno

| rename COMMENT as "let through only groups that show CPU overload and are 10 or more minutes long" 
| where overload=1 AND groupsize >= 10

Dang. Couldn't find the one I wrote a couple of months back, but here are three, depending on how you want your results.

https://answers.splunk.com/answers/374869/how-to-create-and-trigger-an-alert-if-the-cpu-usag.html

https://answers.splunk.com/answers/102865/how-to-alert-user-when-the-processor-time-exceeds-a-certai...

https://answers.splunk.com/answers/462460/how-to-create-an-alert-to-trigger-when-a-host-exce.html

rahulkumarfgf
Explorer

Hi @DalJeanis !

I has similar issue where I want to trigger an alert if CPU usage is 100% for more than 10min. I am using % processor TIme instaed of CPUpct. Wanted to knw if that will provide the same result. Here is my modified SPL:

index="perfmoncpu" source="PerfmonMk:CPU" | bin _time span=1m

| stats avg(%_Processor_Time) as PercentProcessorTime by host _time
|eval PercentProcessorTime = round(PercentProcessorTime, 2)
|eval overload = if(PercentProcessorTime >= 100, 1, 0)
|streamstats current=f last(overload) as prevload by host
|eval newgroup=case(isnull(prevload),1, prevload!=overload,1, true(),0)
|streamstats sum(newgroup) as groupno by host
|eventstats count as groupsize by host groupno
|where overload=1 AND groupsize >= 10
|table overload, host, PercentProcessorTime

Thank you for your help!

0 Karma
Get Updates on the Splunk Community!

Dashboard Studio Challenge - Learn New Tricks, Showcase Your Skills, and Win Prizes!

Reimagine what you can do with your dashboards. Dashboard Studio is Splunk’s newest dashboard builder to ...

Introducing Edge Processor: Next Gen Data Transformation

We get it - not only can it take a lot of time, money and resources to get data into Splunk, but it also takes ...

Take the 2021 Splunk Career Survey for $50 in Amazon Cash

Help us learn about how Splunk has impacted your career by taking the 2021 Splunk Career Survey. Last year’s ...