How can we show the root cause of any exception from the stack trace(if stack trace is available)?
Currently what we have created is chart showing total exceptions, on drill-down different types of exception, again on drill-down default log entries of particular exception which is cliked.
What else we can show through splunk which is more specific to the cause of the exception than just showing the log entry (consisting of full stack trace) for particular exception.
Any help on this.
,One easy way:
What type of log you are trying to find the root cause when analyzing stacktrace?
There are type of log like access_combined and log4j ,etc... already defined on Splunk where basic fields is extracted by default and there are some pretty apps for that purpose.
The Key to finding the root cause to an Exception is your ability to focus on the exact TIME an exception occurred and look at what else was happening in related systems at the time the exception occurred.
You have the first step. You've narrowed down a way to isolate the exception and you can drill down to it, click on it and send it into the search interface (one option).
Once you are there - you need something else to look at. As somesoni2 points out, you have to be Splunking other stuff. So let's say this application also utilizes web servers, databases, runs over particular switches, routers, etc... whatever it is that is also associated would need to have It's machine data in Splunk. The stack trace is the very bottom of the stack if you'll pardon the pun.
What you would do, if you had everything in Splunk, for example in one index (this is only an example, one index is not required, you can search across indexes easily)... you would do something like this:
Start with your chart. Click on the exception you have narrowed down, and drill down to the Search interface.
Now you are in Search and you are looking at the details of the exception. Great. note what you like... but also note that you are narrowed down to a particular "custom time". take a look at the Histogram (the green bars) and poke around a bit to see the pattern of frequency of the errors to be sure you are on the first occurrence, because in fact you could be at the end of a cascade... once you are satisfied that you are at the first place something bad went down and the TIME you want to look at... now clear out everything in the search box except the index declaration.
Leave the time picker the way it is... and run the search.
What that does... is it looks at what else was going on at the same TIME the exception occurred. Now note the field list on the left. Note the sources, and the source types. Click on Source Types...
You know your application. Before you start exploring, be sure you make note of the Timestamp(the custom time). You can even just save the search... you can delete it later. Once you've got that... now look at those source types. You'll see, the number of events occurring for each sourcetype. Do you see one that seems too high? Too low?
Now you can do something simple if you like... you can type the words Error OR Fail OR CRITICAL... or whatever you want to see if you can find errors happening on other systems. Run that search and Splunk will show you where those key words show up. Now that can show you "false positives" if you are too simplistic with your keywords... but you you know your data, use words that are more meaningful... or if the time is narrowed down too tightly... click on the zoom button and back out a bit. Something started before the stack trace began to spew right? So you are really looking for what happened right before the Exception began to report... the question is... what happened and where?
Keep note of the pattern made by the green bars in the histogram. A lot of times when something is starting to go wrong... many systems will try to recover, but they come back to the same 'wrong' circumstance and then take another dive... so you'll see a pattern visible in the histogram.
That's how you work root cause analysis by hand. But again... you need to be Splunking something other than just the stack trace in order to do that.
In Splunk, you can show information which are present in logs (can't add extra information). If your logs have other fields like error message, error code, those can be extracted and displayed. If your logs just have stack trace, you may extract the function name/service/application name where it occurred and may show some statistics based on that. But, as I said earlier, it all depends on the information that you are logging.