Alerting

Create a real-time alert that triggers when count > 2 within 1 minute

damonmanni
Path Finder

Use Case:
• Our Jira instance crashes intermittently when there is heavy load on the svr.
• The cause is The JVM Garbage Collection (GC) does not run effectively as server load increases, eventually crashing java. Also the cpu climbs to 390% usage due to JVM struggling and consuming resources.

Splunk Goal:
• To Monitor in real-time and Alert the Admin when splunk sees the GC beginning to struggle so that admin can do graceful restart of jira b/f it crashes.

KPI:
• Search for “Full GC” in logs, if more than 2 hits are found within 1 minute timespan, then JVM is heading out of control and send email alert to admin.

I need your help on:
From my attempts below, I think I am extracting what I need as a Report, but I don't know how to make the alert trigger only when the internal count (which is 5 in two results below ) > 2 and NOT the total event records found of (5+1=6) or (1+5+1=7) of that day.
It seems that the total # of events returned (6 or 7) will always trip the alert, which is not what we want.

Here is my setup:
Setup my realtime Alert as:
• Type:
alt text

My attempts:
• Query-1 This is to give me a few target dates that have known failures so I can use that data to test with:

index=jira sourcetype=gc host=mdc2vr8223 source="gc-" "[Full GC" | bucket _time span=1m | stats count by _time| eval occurred=if(count>2,"Possible GC issue occurring","GC ok") | table occurred, _time, count

occurred    _time   count

1 GC ok 2018-01-31 18:00:00 1
... etc...
15 GC ok 2018-02-14 18:55:00 1
16 GC ok 2018-02-15 23:00:00 1
17 Possible GC issue occurring 2018-02-19 07:48:00 5
18 GC ok 2018-02-19 08:08:00 1
19 GC ok 2018-02-21 10:12:00 1
20 Possible GC issue occurring 2018-02-21 10:14:00 5
21 GC ok 2018-02-21 10:28:00 1
22 GC ok 2018-02-25 15:00:00 1
23 GC ok 2018-03-01 03:00:00 1

• Query-2 – To simulate a Real-time trigger, I took Query-1 and ran it against a danger date above:

index=jira sourcetype=gc host=mdc2vr8223 source="gc-" "[Full GC" earliest="02/19/2018:00:00:00" latest="02/19/2018:23:00:00" | bucket _time span=1m | stats count by _time| eval occurred=if(count>2,"Possible GC issue occurring","GC ok") | table occurred, _time, count

occurred    _time   count

1 Possible GC issue occurring 2018-02-19 07:48:00 5
2 GC ok 2018-02-19 08:08:00 1

What am I missing?
cheers,
Damon

Tags (1)
0 Karma
1 Solution

starcher
Influencer

If making an alert never let non alert conditions create rows. It's fine for a report not an alert. Instead of "GC ok" use null() and put a | where isnnotnull(occurred) after the stats. Then you should get only rows where the conditions are met.

Also never run "real-time" searches. Run over short intervals. like every 5 minutes.

View solution in original post

0 Karma

starcher
Influencer

If making an alert never let non alert conditions create rows. It's fine for a report not an alert. Instead of "GC ok" use null() and put a | where isnnotnull(occurred) after the stats. Then you should get only rows where the conditions are met.

Also never run "real-time" searches. Run over short intervals. like every 5 minutes.

0 Karma

damonmanni
Path Finder

Thank you Starcher! I did everything you mentioned. That did the trick
cheers,
D

0 Karma

DalJeanis
Legend

@damonmanni - we converted starcher's comment to an answer. Please accept the answer so that your question will show as closed.

0 Karma
Get Updates on the Splunk Community!

What's New in Splunk Enterprise 9.4: Features to Power Your Digital Resilience

Hey Splunky People! We are excited to share the latest updates in Splunk Enterprise 9.4. In this release we ...

Take Your Breath Away with Splunk Risk-Based Alerting (RBA)

WATCH NOW!The Splunk Guide to Risk-Based Alerting is here to empower your SOC like never before. Join Haylee ...

SignalFlow: What? Why? How?

What is SignalFlow? Splunk Observability Cloud’s analytics engine, SignalFlow, opens up a world of in-depth ...