Alerting

Create a real-time alert that triggers when count > 2 within 1 minute

damonmanni
Path Finder

Use Case:
• Our Jira instance crashes intermittently when there is heavy load on the svr.
• The cause is The JVM Garbage Collection (GC) does not run effectively as server load increases, eventually crashing java. Also the cpu climbs to 390% usage due to JVM struggling and consuming resources.

Splunk Goal:
• To Monitor in real-time and Alert the Admin when splunk sees the GC beginning to struggle so that admin can do graceful restart of jira b/f it crashes.

KPI:
• Search for “Full GC” in logs, if more than 2 hits are found within 1 minute timespan, then JVM is heading out of control and send email alert to admin.

I need your help on:
From my attempts below, I think I am extracting what I need as a Report, but I don't know how to make the alert trigger only when the internal count (which is 5 in two results below ) > 2 and NOT the total event records found of (5+1=6) or (1+5+1=7) of that day.
It seems that the total # of events returned (6 or 7) will always trip the alert, which is not what we want.

Here is my setup:
Setup my realtime Alert as:
• Type:
alt text

My attempts:
• Query-1 This is to give me a few target dates that have known failures so I can use that data to test with:

index=jira sourcetype=gc host=mdc2vr8223 source="gc-" "[Full GC" | bucket _time span=1m | stats count by _time| eval occurred=if(count>2,"Possible GC issue occurring","GC ok") | table occurred, _time, count

occurred    _time   count

1 GC ok 2018-01-31 18:00:00 1
... etc...
15 GC ok 2018-02-14 18:55:00 1
16 GC ok 2018-02-15 23:00:00 1
17 Possible GC issue occurring 2018-02-19 07:48:00 5
18 GC ok 2018-02-19 08:08:00 1
19 GC ok 2018-02-21 10:12:00 1
20 Possible GC issue occurring 2018-02-21 10:14:00 5
21 GC ok 2018-02-21 10:28:00 1
22 GC ok 2018-02-25 15:00:00 1
23 GC ok 2018-03-01 03:00:00 1

• Query-2 – To simulate a Real-time trigger, I took Query-1 and ran it against a danger date above:

index=jira sourcetype=gc host=mdc2vr8223 source="gc-" "[Full GC" earliest="02/19/2018:00:00:00" latest="02/19/2018:23:00:00" | bucket _time span=1m | stats count by _time| eval occurred=if(count>2,"Possible GC issue occurring","GC ok") | table occurred, _time, count

occurred    _time   count

1 Possible GC issue occurring 2018-02-19 07:48:00 5
2 GC ok 2018-02-19 08:08:00 1

What am I missing?
cheers,
Damon

Tags (1)
0 Karma
1 Solution

starcher
SplunkTrust
SplunkTrust

If making an alert never let non alert conditions create rows. It's fine for a report not an alert. Instead of "GC ok" use null() and put a | where isnnotnull(occurred) after the stats. Then you should get only rows where the conditions are met.

Also never run "real-time" searches. Run over short intervals. like every 5 minutes.

View solution in original post

0 Karma

starcher
SplunkTrust
SplunkTrust

If making an alert never let non alert conditions create rows. It's fine for a report not an alert. Instead of "GC ok" use null() and put a | where isnnotnull(occurred) after the stats. Then you should get only rows where the conditions are met.

Also never run "real-time" searches. Run over short intervals. like every 5 minutes.

0 Karma

damonmanni
Path Finder

Thank you Starcher! I did everything you mentioned. That did the trick
cheers,
D

0 Karma

DalJeanis
SplunkTrust
SplunkTrust

@damonmanni - we converted starcher's comment to an answer. Please accept the answer so that your question will show as closed.

0 Karma
Get Updates on the Splunk Community!

Adoption of RUM and APM at Splunk

    Unleash the power of Splunk Observability   Watch Now In this can't miss Tech Talk! The Splunk Growth ...

March Community Office Hours Security Series Uncovered!

Hello Splunk Community! In March, Splunk Community Office Hours spotlighted our fabulous Splunk Threat ...

Stay Connected: Your Guide to April Tech Talks, Office Hours, and Webinars!

Take a look below to explore our upcoming Community Office Hours, Tech Talks, and Webinars in April. This post ...