Alerting

Create a real-time alert that triggers when count > 2 within 1 minute

damonmanni
Path Finder

Use Case:
• Our Jira instance crashes intermittently when there is heavy load on the svr.
• The cause is The JVM Garbage Collection (GC) does not run effectively as server load increases, eventually crashing java. Also the cpu climbs to 390% usage due to JVM struggling and consuming resources.

Splunk Goal:
• To Monitor in real-time and Alert the Admin when splunk sees the GC beginning to struggle so that admin can do graceful restart of jira b/f it crashes.

KPI:
• Search for “Full GC” in logs, if more than 2 hits are found within 1 minute timespan, then JVM is heading out of control and send email alert to admin.

I need your help on:
From my attempts below, I think I am extracting what I need as a Report, but I don't know how to make the alert trigger only when the internal count (which is 5 in two results below ) > 2 and NOT the total event records found of (5+1=6) or (1+5+1=7) of that day.
It seems that the total # of events returned (6 or 7) will always trip the alert, which is not what we want.

Here is my setup:
Setup my realtime Alert as:
• Type:
alt text

My attempts:
• Query-1 This is to give me a few target dates that have known failures so I can use that data to test with:

index=jira sourcetype=gc host=mdc2vr8223 source="gc-" "[Full GC" | bucket _time span=1m | stats count by _time| eval occurred=if(count>2,"Possible GC issue occurring","GC ok") | table occurred, _time, count

occurred    _time   count

1 GC ok 2018-01-31 18:00:00 1
... etc...
15 GC ok 2018-02-14 18:55:00 1
16 GC ok 2018-02-15 23:00:00 1
17 Possible GC issue occurring 2018-02-19 07:48:00 5
18 GC ok 2018-02-19 08:08:00 1
19 GC ok 2018-02-21 10:12:00 1
20 Possible GC issue occurring 2018-02-21 10:14:00 5
21 GC ok 2018-02-21 10:28:00 1
22 GC ok 2018-02-25 15:00:00 1
23 GC ok 2018-03-01 03:00:00 1

• Query-2 – To simulate a Real-time trigger, I took Query-1 and ran it against a danger date above:

index=jira sourcetype=gc host=mdc2vr8223 source="gc-" "[Full GC" earliest="02/19/2018:00:00:00" latest="02/19/2018:23:00:00" | bucket _time span=1m | stats count by _time| eval occurred=if(count>2,"Possible GC issue occurring","GC ok") | table occurred, _time, count

occurred    _time   count

1 Possible GC issue occurring 2018-02-19 07:48:00 5
2 GC ok 2018-02-19 08:08:00 1

What am I missing?
cheers,
Damon

Tags (1)
0 Karma
1 Solution

starcher
Influencer

If making an alert never let non alert conditions create rows. It's fine for a report not an alert. Instead of "GC ok" use null() and put a | where isnnotnull(occurred) after the stats. Then you should get only rows where the conditions are met.

Also never run "real-time" searches. Run over short intervals. like every 5 minutes.

View solution in original post

0 Karma

starcher
Influencer

If making an alert never let non alert conditions create rows. It's fine for a report not an alert. Instead of "GC ok" use null() and put a | where isnnotnull(occurred) after the stats. Then you should get only rows where the conditions are met.

Also never run "real-time" searches. Run over short intervals. like every 5 minutes.

0 Karma

damonmanni
Path Finder

Thank you Starcher! I did everything you mentioned. That did the trick
cheers,
D

0 Karma

DalJeanis
Legend

@damonmanni - we converted starcher's comment to an answer. Please accept the answer so that your question will show as closed.

0 Karma
Get Updates on the Splunk Community!

Webinar Recap | Revolutionizing IT Operations: The Transformative Power of AI and ML ...

The Transformative Power of AI and ML in Enhancing Observability   In the realm of IT operations, the ...

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...