I'm having issues creating an alert that looks at lets say 100 different hosts, but I need to get an alert if one or more host is at 95% CPU usage for a span of 10 min.
Here is what I have so far but it doesn't seem to work because for example if 3 hosts are over 95% for let's say 4min each, I get an alert. Although, I only need the alert if 1 server is over 98% for 10 min:
index=xxx counter="% Processor Time" sourcetype="perfmon:processor" Value > 95 | timechart span=10m eval(round(avg(Value),1)) by host useother=f
I found a couple similar posts on this question, but none of the proposed solutions worked.
@Jpaule - has your need been met, or do you still need help with this one? If the rsponse worked for you, please accept it so that the question will show as answered. Thanks!
Okay, here's the way you need to look at this. You CAN'T kill the records that are under 95%, because those are what tells you that you DON'T want to alert.
your search that brings back all records for each host with _time, host, CPUpct | bin _time span=1m | stats max(CPUpct) as CPUpct by host _time | eval overload=if(CPUpct>=95,1,0) | rename COMMENT as "the above gets the worst CPU stat in the period and calculates whether that period is over the threshold" | rename COMMENT as "check whether the overload value has changed, mark the group number and count how many events in the group." | streamstats current=f last(overload) as prevload by host | eval newgroup=case(isnull(prevload),1, prevload!=overload,1, true(),0) | streamstats sum(newgroup) as groupno by host | eventstats count as groupsize by host groupno | rename COMMENT as "let through only groups that show CPU overload and are 10 or more minutes long" | where overload=1 AND groupsize >= 10
Dang. Couldn't find the one I wrote a couple of months back, but here are three, depending on how you want your results.
Hi @DalJeanis !
I has similar issue where I want to trigger an alert if CPU usage is 100% for more than 10min. I am using % processor TIme instaed of CPUpct. Wanted to knw if that will provide the same result. Here is my modified SPL:
index="perfmoncpu" source="PerfmonMk:CPU" | bin _time span=1m
| stats avg(%_Processor_Time) as PercentProcessorTime by host _time
|eval PercentProcessorTime = round(PercentProcessorTime, 2)
|eval overload = if(PercentProcessorTime >= 100, 1, 0)
|streamstats current=f last(overload) as prevload by host
|eval newgroup=case(isnull(prevload),1, prevload!=overload,1, true(),0)
|streamstats sum(newgroup) as groupno by host
|eventstats count as groupsize by host groupno
|where overload=1 AND groupsize >= 10
|table overload, host, PercentProcessorTime
Thank you for your help!