All Apps and Add-ons

Machine Learning and JMX data

Splunk Employee
Splunk Employee

I have a Java application giving me some JMX data. The application dies(freezes) on a very uneven basis.
From the JMX data I've been able to see a few metrics that visually seem to indicate that there will occur a freeze.
My intent is to use these parameters to predict a near future freeze.
So the ones I've spotted are CPUTime and gCDuration as well as the number of gCthreads.
When CPUTime is high and the gCDuration peaks when the gCthreads are more than 300, someone will notice and they will manually reset the JVM - this is the scenario I want to predict in ML.
As I try different approaches in the ML toolkit, none of them see to do the trick.
How do I know what model is the "right" one?

0 Karma

SplunkTrust
SplunkTrust

So lets back up and discuss the math behind how this will predict.. Remember the old equation y=mx+b?

We will apply that here, where the Y represents the field you want to predict where x is going to represent your predictor field (dependent variable). Lets talk simple linear regression with 1 dependent variable

First off, forget the predict command and lets do this in the MLTK. You will first need to fit your model and select a dependent variable which has a relationship with your independent variable. You can establish this relationship by fitting your model and reading the output, RMSE and line of best fit graphs. If there's little to no relationship then the dots will not hug that line of best fit and the RMSE will measure how far the residual points were from the expected field. You will need to use this feedback to determine if your field is a good field to help predict your independent variable. You also need to have a good sample size your model is looking over and I typically like to have an 80/20 split between learning and testing.

0 Karma

Splunk Employee
Splunk Employee

Hi
Yes - I have a (old) major in math 🙂
Rusty as it is, I was trying to use that in MLTK just as you suggested. One of the challenges is that there is no single conclusive field that is 100% spot on. Sometimes the CPU will peak >90% and threads>300 and gCDuration will increase. Next time the CPU is idling while the thread count approaches 400 and the gCDuration is again increasing.
I use jvmUptime to see when they have manually reset the JVM and that value (=result) correlates with one or more of the other ones I mentioned. There are a few others as well but not as clear as these.
We have toyed with 50/50 as well as 80/20 and we can get R2 to be quite close to 1 but then the RMSE value skyrockets even though the predicted vs. the actual graphs are decent.

/Confused
(appreciate the feedback!)

0 Karma

SplunkTrust
SplunkTrust

Yeah that is expected in real-life data. You will need to use multiple fields to help predict your independent variable.

To know what fields to use, you should use the analyzefields search command. You will need to identify which fields have a good relationship with your independent variable and start testing until you get a desirable result

http://docs.splunk.com/Documentation/Splunk/7.0.2/SearchReference/Analyzefields

0 Karma

Splunk Employee
Splunk Employee

Hi
In the end we had to fold as they moved the application to a newer host.
Another thing to consider is that it's very hard to determine the state of a JAVA application based on the JMX data alone, as you have no real clue on how the application itself is doing, just the JVM.
It's a bit like looking at an ambulance and trying to determine if the patient inside is dead or not.
Bur thanks for the confirmation @skoelpin that we were at least going in the right direction.

0 Karma