Alerting

Need a Splunk alert that fires when cpu % mem_use% OR disk use % >75% (while also indicating the top offending processes)

spluzer
Communicator

Hello Splunkers. Noob here. I have an alert that fires when any three metrics (listed in title) goes above 75%. I just need to add into the alert what top offending processes are causing the overages. Here is my query so far, which does work to illustrate when either 3 main metrics goes above 75%.

index=blah (sourcetype="PerfDisk" OR sourcetype="PerfCPU" OR sourcetype="PerfMem" OR sourcetype="PerfProcess") (host=blah OR host=blah OR host=blah OR host=blah ) earliest=-5m
| stats avg(%CommittedBytes) as mem_use_prcnt
avg(cpuLoadPerc) as cpu_load_prcnt
avg(%DiskTime) as disk_utilization_prcnt
by host
| eval fire_it_up = case(cpu_load_prcnt > 75,1,
mem_use_prcnt > 75,1,
disk_utilization_prcnt > 75,1,
true() ,0)
| where fire_it_up > 0
| table all three metrics

Any ideas on getting the top offending processes causing the overages???. Any help is much appreciated.

0 Karma
1 Solution

spluzer
Communicator

Here is what I ended up doing:

index=win sourcetype="Perf:logDisk" instance!=_Total (host=myhost) earliest=-5m
| eval volume = instance
| stats avg(%_Disk_Time) as diskUse% by volume host
| join type=left host
[| search index=win sourcetype="Perf:Process" category=%_Processor_Time=* NOT(instance IN(_Total, Idle)) (host=myhost) earliest=-5m
| stats avg(%_Processor_Time) as %_Processor_Time by host instance
| sort -%_Processor_Time
| streamstats count by host
| where count=1
| eval %_Processor_Time=round('%_Processor_Time')
| eval Additional_InfoCPU = "Top Resource Task=" . instance . ", Task Time=" . '%_Processor_Time'
| fields host Additional_InfoCPU ]

then repeated that for a bunch of other metrics (mem%, cpu% etc etc) in separate subsearches

Then

| eval mem_use_% = round(mem_use_%, 2)
| eval cpu_load_% = round(cpu_load_%, 2)
| eval disk_utilization_% = round(disk_utilization_%, 2)
| eval Individual_DiskUse_%t = round(Individual_DiskUse_%, 2)
| eval fire_alert = case(cpu_load_% > 75,1,
mem_use_%> 75,1,
Individual_DiskUse_%> 75,1,
true() ,0)
| where fire_Alert>0
| stats values(volume) values(DiskUse_%) by everything you want

| Table it all out

View solution in original post

0 Karma

Sukisen1981
Champion

hi @spluzer
I just need to add into the alert what top offending processes are causing the overages...well then you need to capture the process names under cpu,memory or disk . I am sure its mentiioned in your events somewhere?
you just cant go by sourcetype , all that would mean is if cpu spikes >75% we know its the PerfCPU sourcetype.
Perhaps you have more granular details than that, like under that source types which are the cpu process names?

0 Karma

spluzer
Communicator

Here is what I ended up doing:

index=win sourcetype="Perf:logDisk" instance!=_Total (host=myhost) earliest=-5m
| eval volume = instance
| stats avg(%_Disk_Time) as diskUse% by volume host
| join type=left host
[| search index=win sourcetype="Perf:Process" category=%_Processor_Time=* NOT(instance IN(_Total, Idle)) (host=myhost) earliest=-5m
| stats avg(%_Processor_Time) as %_Processor_Time by host instance
| sort -%_Processor_Time
| streamstats count by host
| where count=1
| eval %_Processor_Time=round('%_Processor_Time')
| eval Additional_InfoCPU = "Top Resource Task=" . instance . ", Task Time=" . '%_Processor_Time'
| fields host Additional_InfoCPU ]

then repeated that for a bunch of other metrics (mem%, cpu% etc etc) in separate subsearches

Then

| eval mem_use_% = round(mem_use_%, 2)
| eval cpu_load_% = round(cpu_load_%, 2)
| eval disk_utilization_% = round(disk_utilization_%, 2)
| eval Individual_DiskUse_%t = round(Individual_DiskUse_%, 2)
| eval fire_alert = case(cpu_load_% > 75,1,
mem_use_%> 75,1,
Individual_DiskUse_%> 75,1,
true() ,0)
| where fire_Alert>0
| stats values(volume) values(DiskUse_%) by everything you want

| Table it all out

0 Karma

richgalloway
SplunkTrust
SplunkTrust

@spluzer If your problem is resolved, please accept the answer to help future readers.

---
If this reply helps you, Karma would be appreciated.
0 Karma
Get Updates on the Splunk Community!

Observability | How to Think About Instrumentation Overhead (White Paper)

Novice observability practitioners are often overly obsessed with performance. They might approach ...

Cloud Platform | Get Resiliency in the Cloud Event (Register Now!)

IDC Report: Enterprises Gain Higher Efficiency and Resiliency With Migration to Cloud  Today many enterprises ...

The Great Resilience Quest: 10th Leaderboard Update

The tenth leaderboard update (11.23-12.05) for The Great Resilience Quest is out >> As our brave ...