Need help in getting report on exceptions

msing34 · ‎10-26-2017

I have log files where we are getting different type of error messages

[10/26/17 17:29:59:635 CDT] 00030f30 SystemErr R com.test.myapp.framework.error.exception.TransactionException: com.test.myapp.framework.error.exception.CommandException: Problem with COLD external system
at com.test.myapp.transactionscripts.transactions.ColdHealthStatementsT.getModel(ColdHealthStatementsT.java:44)
at com.test.myapp.transactionscripts.ViewStatsTS.getModel(ViewStatsTS.java:102)
at com.test.myapp.actions.MemberAction.doRender(MemberAction.java:167)
at com.test.myapp.actions.portlet.PortletAction.doRender(PortletAction.java:40)
at com.test.myapp.framework.action.BaseAction.dispatch(BaseAction.java:455)
at com.test.myapp.actions.SecureMemberAction.dispatch(SecureMemberAction.java:72)
at com.test.myapp.framework.action.BaseAction.execute(BaseAction.java:196)

[10/26/17 17:33:50:916 CDT] 000619ae SystemErr R java.net.SocketTimeoutException: Read timed out

[10/26/17 17:23:08:145 CDT] 0009e9ce SystemErr R com.test.myapp.framework.error.exception.ApplicationException: com.test.mbr.ldap.adapter.LDAPAdapterException: Couldn't update user:
at com.test.myapp.actions.EasyLoginAction.validateUserLDAPdataForTwoKey(EasyLoginAction.java:1402)
at com.test.myapp.actions.EasyLoginAction.doRender(EasyLoginAction.java:764)
at com.test.myapp.framework.action.BaseAction.dispatch(BaseAction.java:455)
at com.test.myapp.framework.action.BaseAction.execute(BaseAction.java:196)
at org.springframework.web.struts.DelegatingActionProxy.execute(DelegatingActionProxy.java:110)
at org.apache.struts.action.RequestProcessor.processActionPerform(RequestProcessor.java:484)
at org.apache.struts.action.RequestProcessor.process(RequestProcessor.java:274)
at org.apache.struts.action.ActionServlet.process(ActionServlet.java:1482)

I want to do 2 things

I want to get report on Different type of Excpetions and counts
I want to set up alert via email when a specific threshold is breached .

How can I do that .

Richfez · ‎10-27-2017

Ah, the lovely java error logs!

I only deal with java logs a bit, but these look reasonably straightforward. If you don't care about all the huge trailing pile of 'at .com..." lines, you could do something like this as a start.

my search here | rex "^(?<DateTimeString>\[\d{2}\/\d{2}\/\d{2}\s\d{2}:\d{2}:\d{2}:\d{3}\s\w{3}\])\s(?<ExceptionID>\S{8})\s(?<Exception_Level>\S+)\s(?<SomeCharacter>\S+)\s(?<MyError>.*)"

Should return

DateTimeString = [10/26/17 17:29:59:635 CDT]
ExceptionID = 00030f30
Exception_Level = SystemErr
SomeCharacter = R
MyError = com.test.myapp.framework.error.exception.TransactionException: com.test.myapp.framework.error.exception.CommandException: Problem with COLD external system

There could be line breaking issues - if the rex grabs ALL the remaining text as "MyError" we can fix that, I'd just have to look up the multiline or non-multiline option on the rex.

Protip: if that works fine and you'd like to make it automatically happen, you COULD edit transforms or props and put it there, or you could start the field extractor, pick your sourcetypes/events you want to extract from, do a regex based extraction and then click the option to "I'll write my own" and paste in the rex's extraction (without quotes) and if that works, save it! You may have to change the permissions on it to share it globally.

Once you have those fields extracting, there are a variety of ways to get to a report and/or alert.

For a report, a search like the following - NOTE I'm assuming you have the rex converted to something that works automatically. If you do NOT have that done, no worries, just include the | rex ... inbetween the my search here and the | stats ...., right?

my search here 
| stats count by Exception_Level, MyError | sort - count

Would get you a list with count of each, sorted by the most common first.

If you'd prefer a chart of errors on a timeline, you could try replacing the | stats ... (to the end) with

| timechart count by MyError

Or

| timechart count by Exception_Level

So, for an alert. There's a couple of ways to do this too. First, you'll want to review your data like we just did and find a good threshold - an alert that happens ALL THE TIME isn't useful to you because you'll learn to hate it. Nope, the alert should only tell you when something's actually broken. That way when you get annoyed with it, the 'fix' is to make the conditions that trigger the alert get fixed (e.g. fix the problem) instead of just turn off the unuseful alert. 🙂

So, find a criteria. Let's say it's when there is more than 5 alerts in 5 minutes for one error. So let's search, stats things up, then search those results, K?

my search here earliest=-5m
| stats count by Exception_Level, MyError
| search count>5

You could literally save that as an alert right there, alerting whenever it returns results. I'd schedule it every 5 minutes. So every 5 minutes it would look back at the past 5 minutes and decide if there were enough events, and if so, send whatever you have set up for your alert. Please be careful especially when testing this, and be sure to thoroughly read the alerts manual section on trigger conditions and throttling before leaving for the weekend.

Hope this helps!

Happy Splunking,
Rich

msing34 · ‎10-30-2017

Great . Thx Rich for helping and providing full details on how to use this . Really helpful !

kunalmao · ‎10-27-2017

can you tell me which fields are getting extracted from these logs in splunk ?

Need help in getting report on exceptions

Continuing Innovation & New Integrations Unlock Full Stack Observability For Your ...

Monitoring Amazon Elastic Kubernetes Service (EKS)

Cloud Platform & Enterprise: Classic Dashboard Export Feature Deprecation