Alerting

Need a Splunk alert that fires when cpu % mem_use% OR disk use % >75% (while also indicating the top offending processes)

Path Finder

Hello Splunkers. Noob here. I have an alert that fires when any three metrics (listed in title) goes above 75%. I just need to add into the alert what top offending processes are causing the overages. Here is my query so far, which does work to illustrate when either 3 main metrics goes above 75%.

index=blah (sourcetype="PerfDisk" OR sourcetype="PerfCPU" OR sourcetype="PerfMem" OR sourcetype="PerfProcess") (host=blah OR host=blah OR host=blah OR host=blah ) earliest=-5m
| stats avg(%CommittedBytes) as memuseprcnt
avg(cpuLoadPerc) as cpuloadprcnt
avg(%DiskTime) as diskutilizationprcnt
by host
| eval fireitup = case(cpuloadprcnt > 75,1,
memuseprcnt > 75,1,
diskutilizationprcnt > 75,1,
true() ,0)
| where fireitup > 0
| table all three metrics

Any ideas on getting the top offending processes causing the overages???. Any help is much appreciated.

0 Karma
1 Solution

Path Finder

Here is what I ended up doing:

index=win sourcetype="Perf:logDisk" instance!=Total (host=myhost) earliest=-5m
| eval volume = instance
| stats avg(%
DiskTime) as diskUse% by volume host
| join type=left host
[| search index=win sourcetype="Perf:Process" category=%
ProcessorTime=* NOT(instance IN(Total, Idle)) (host=myhost) earliest=-5m
| stats avg(%ProcessorTime) as %ProcessorTime by host instance
| sort -%ProcessorTime
| streamstats count by host
| where count=1
| eval %ProcessorTime=round('%ProcessorTime')
| eval AdditionalInfoCPU = "Top Resource Task=" . instance . ", Task Time=" . '%ProcessorTime'
| fields host Additional
InfoCPU ]

then repeated that for a bunch of other metrics (mem%, cpu% etc etc) in separate subsearches

Then

| eval memuse% = round(memuse%, 2)
| eval cpuload% = round(cpuload%, 2)
| eval diskutilization% = round(diskutilization%, 2)
| eval IndividualDiskUse%t = round(IndividualDiskUse%, 2)
| eval firealert = case(cpuload% > 75,1,
mem
use%> 75,1,
Individual
DiskUse%> 75,1,
true() ,0)
| where fire
Alert>0
| stats values(volume) values(DiskUse_%) by everything you want

| Table it all out

View solution in original post

0 Karma

Champion

hi @spluzer
I just need to add into the alert what top offending processes are causing the overages...well then you need to capture the process names under cpu,memory or disk . I am sure its mentiioned in your events somewhere?
you just cant go by sourcetype , all that would mean is if cpu spikes >75% we know its the PerfCPU sourcetype.
Perhaps you have more granular details than that, like under that source types which are the cpu process names?

0 Karma

Path Finder

Here is what I ended up doing:

index=win sourcetype="Perf:logDisk" instance!=Total (host=myhost) earliest=-5m
| eval volume = instance
| stats avg(%
DiskTime) as diskUse% by volume host
| join type=left host
[| search index=win sourcetype="Perf:Process" category=%
ProcessorTime=* NOT(instance IN(Total, Idle)) (host=myhost) earliest=-5m
| stats avg(%ProcessorTime) as %ProcessorTime by host instance
| sort -%ProcessorTime
| streamstats count by host
| where count=1
| eval %ProcessorTime=round('%ProcessorTime')
| eval AdditionalInfoCPU = "Top Resource Task=" . instance . ", Task Time=" . '%ProcessorTime'
| fields host Additional
InfoCPU ]

then repeated that for a bunch of other metrics (mem%, cpu% etc etc) in separate subsearches

Then

| eval memuse% = round(memuse%, 2)
| eval cpuload% = round(cpuload%, 2)
| eval diskutilization% = round(diskutilization%, 2)
| eval IndividualDiskUse%t = round(IndividualDiskUse%, 2)
| eval firealert = case(cpuload% > 75,1,
mem
use%> 75,1,
Individual
DiskUse%> 75,1,
true() ,0)
| where fire
Alert>0
| stats values(volume) values(DiskUse_%) by everything you want

| Table it all out

View solution in original post

0 Karma

SplunkTrust
SplunkTrust

@spluzer If your problem is resolved, please accept the answer to help future readers.

---
If this reply helps you, an upvote would be appreciated.
0 Karma