Predictive analysis using linear regression and ka...

ankycampy · ‎09-17-2017

Hi All,

I am trying to predict cpu utilization of servers using Machine learning toolkit app of splunk, during the use of this app i found "predict numeric field" showcase using Linear regression algorithm was doing perfect prediction for the given field but it cannot be used for forecasting the same
I tried merging splunk queries of Linear regression and Kalman filter to forecast the Predicted field, is this approach correct ?
Find below Query i used for the same and let me know your thoughts and suggesstions.
I am trying this because i am not sure about the prediction results of only kalman filter.

index=main sourcetype=cpumetric metric_name=CPUUtilization   Environment="WEB"  Average>2.00 
| apply "Predict_CPUUtilization" 
| table _time, "Average", "predicted(Average)" | rename predicted(Average) as Avrg | timechart span=15m avg(Avrg) 
| predict "avg(Avrg)" as prediction algorithm="LLP5" future_timespan="3" holdback="0"  lower"50"=lower"50" upper"50"=upper"50" 
| `forecastviz(3, 0, "avg(Avrg)", 50)

Sukisen1981 · ‎09-18-2017

Hi,

What you have done stats wise is ok but I do doubt on your initial regression model, however if you feel it indeed works and have experimented with all the 'methods' under the kalman filter it is still remotely possible. Remember the kalman prediction in this case is based on your linear regression and will be as good and as bad as your linear model fit. Cab you validate the prediction after applying the kalman filter and see how it works? Essentially , have r^ 2 as 1 does mean that you have been able to explain all the variation perfectly, however it DOES NOT mean that ONLY the selected dependent variables are the ones your independent variable depends on... So whatever dependent variables you have chosen to predict cpu utilization explains all of the variance given that you can always get R2=1 if you have a number of predicting variables equal to the number of observations, or if you've estimated an intercept the number of observations - 1. the proof of the pudding is in eating, and you should validate your prediction against actuals. Please read the below excerpt on having R^2 as 1. I still think your linear regression model might be missing something....

'An R2=1 indicates perfect fit. That is, you've explained all of the variance that there is to explain. In ordinary least squares (OLS) regression (the most typical type), your coefficients are already optimized to maximize the degree of model fit (R2) for your variables and all linear transforms of your variables. Your model appears to be a little odd in that x is being raised to a particular exponent, so your mileage may vary. But in response to your general question, you can always get R2=1 if you have a number of predicting variables equal to the number of observations, or if you've estimated an intercept the number of observations - 1. Either way, 20 parameters perfectly describes 20 data points. Such a model is called just-identified. Although this gives you the highly desirable perfect fit... it is essentially meaningless. '

DalJeanis · ‎09-18-2017

@sukisen1981 - Yeah, perfect match makes me itch - it almost always is equivalent to "meaningless".

WAIT - @ankycampy - Can you please explain what Average>2 is doing in the initial search?

In my experience, that can't be right. Limiting records at that point, in that way, would result in an erroneous subsetting of the data. Basically, you eliminated the detail records with average below 2.0, so it would be impossible to correctly predict anything based on actual prior history... meaning that whatever you are doing isn't really prediction.

Start over at basics.

Sukisen1981 · ‎09-19-2017

@ankycampy - What @DalJeanis is saying is very important. If you build a linear regression model that has R^2 as high as 1 / 0.98 etc,. it should be able to explain ALL the independent variable variance, so in you case if average cpu utilization is 0.5% or 50% , you should try to have a model that fits both these values. It is better to have a model with (a bit ) low R^2 value rather than eliminate values just in order to get a high R^2 value....But if you are satisfied with your model and it is predicting well (actuals against predicted) then , you can of course disregard our 'expert' advice 🙂

Sukisen1981 · ‎09-18-2017

hmm R^2 as 1 ? that means your linear regression model is able to 100% account for the independent variable? That is very rare, are you sure the model is right? What is the equation your linear regression model builds?
Also, how does the predicted vs actuals compare when your use LLP? Is there a big difference?

ankycampy · ‎09-19-2017

Hi,
I might be doing many things wrong here but want to make it correct with the help of all of you, Thanks for the responses.

I have removed Average>2 and build the model again using Linear Regression, I am predicting CPU Average Utilization and using few fields to help predict "Average" field, helping fields are "Max", "Min", "Sum" , "_time" where max , min and sum are cpu values coming from source

index=main sourcetype=xyz metric_name=CPUUtilization | where EnvCategory="WEB"
| apply "Predict_CPUUtilization"

I am getting R^2 as 1 and RMSE as 0.
Please check below images for reference:-
https://s3.amazonaws.com/predimages/Fitting-model-snap-1.JPG
https://s3.amazonaws.com/predimages/Fitting-model-snap-2.JPG

Then I am adding model to forecast Predicted Average using kalman filter with LLP5 Method :-

index=main sourcetype=xyz metric_name=CPUUtilization | where EnvCategory="WEB"
| apply "Predict_CPUUtilization"
| table _time, "Average", "predicted(Average)"
| rename predicted(Average) as Avrg | timechart span=5m avg(Avrg)
| predict "avg(Avrg)" as prediction algorithm="LLP5" future_timespan="3" holdback="0" lower"50"=lower"50" upper"50"=upper"50" | forecastviz(3, 0, "avg(Avrg)", 50)

getting R^2 as 1 and RMSE as 0.01

Please check below Images for reference:-
https://s3.amazonaws.com/predimages/Forecasting-1.JPG
https://s3.amazonaws.com/predimages/Forecasting-2.JPG

Please check and let me know what correction is needed. Thanks!

Sukisen1981 · ‎09-19-2017

Hmmm .. I am getting very curious on this case. You are predicting average cpu utilization based on max, min, sum of the SAME cpu utilization values? That could be a reason why you are receiving a R^2 value of 1. In essence you are linear regressing an independent variable, whose dependent variable(s) are the same as the independent variable.....
OR
The max,min,avg cpu utilization are for some different system / application which in turn is being used to predict the average cpu utilization of a new system/application. Even in this case the relation is so linear that you do not probably need a regression model....

What happens if you just use the LLP kalman filter to predict the cpu utilization, forgetting the linear regression part altogether?

One note on any statistical model you use - the maths is only as right as the qualitative factors (choice of dependent variables) in your case. There are a couple o CPU utilization regression models I am currently using BUT , the factors I have chosen is something like this :
cpu util = K + A(no. of log ins) + b(no. of concurrent log ins) + c (no.of stuck threads) + d(no.of calls to a particular link within the application that we knew was a major overhead)...
and I found , for my case the random forest regressor to wok best

ankycampy · ‎11-14-2017

Hi sukisen,

               When you say you are getting CPU utilization prediction using given below factors:-

"the factors I have chosen is something like this :
cpu util = K + A(no. of log ins) + b(no. of concurrent log ins) + c (no.of stuck threads) + d(no.of calls to a particular link within the application that we knew was a major overhead)..."

you are getting these results using random forest regressor, are you forecasting the cpu utilization as well ?
If yes, then have you used random forest regressor and kalman filter together ?

i main motive is to forecast the CPU utilization based on similar factors you used ? but i am not sure how to forecast the predicted results ?
hope to get some help on this from you.

ankycampy · ‎09-19-2017

Hi,
Yes, the max, min, sum, average are from different server, Its like i am collecting data of compete web layer of our application (assume 20-30 web servers), and trying to predict average cpu utilization for web layer.

Using only kalman filter :-

index=main sourcetype=xyz metric_name=CPUUtilization EnvCategory="WEB"
| table _time, "Average" | timechart span=15m avg(Average) | predict "avg(Average)" as prediction algorithm="LLP5" future_timespan="3" holdback="0" lower"50"=lower"50" upper"50"=upper"50" | forecastviz(3, 0, "avg(Average)", 50)

Using both Linera regression and kalman together:-

index=main sourcetype=xyz metric_name=CPUUtilization EnvCategory="WEB" | apply "Predict_CPUUtilization"
| table _time, "Average", "predicted(Average)" | rename predicted(Average) as Avrg | timechart span=15m avg(Avrg) | predict "avg(Avrg)" as prediction algorithm="LLP5" future_timespan="3" holdback="0" lower"50"=lower"50" upper"50"=upper"50" | forecastviz(3, 0, "avg(Avrg)", 50)

I am getting exactly similar prediction forecast for both above scenarios.

hmm, does it mean only kalman filter results i am getting in both case.
I will try to find the dependent variables for the web layer and will use them to see what difference i will get.

Sukisen1981 · ‎09-20-2017

hmmm. well, it is not that kalman filter applied over your linear regression is the only thing that is working. What is means is that your linear regression dependent variables are not really dependent in the sense that you are trying to predict something like avg utilization (A+B+C) = k + n*max (A+B+C) + m * min(A+B+C)...not exactly but hope you get the drift.
Hence it does not matter what the linear regression model predicts , the klaman filter LLP is enough and one is as good as the other. I do believe you need some other dependent variables from the web layer as you say, because atm your model , though mathematically correct is merely a linear slope line over the dependent variables that actually go into directly (summation) of the independent variable. That is the reason why you are getting R^2 as 1. You can still use your model , it is not wrong but think about it, do we really need a model to tell us that avg cpu utilization will dip by a factor of X , if one of the constituents CPUs have a dip? Won't it be better if we can say something like 'if X log ins in 3 of the 20-30 web servers are more THEN the total avg cpu utilization increases by a factor Y?'

One more test that you can run is instead of having min. max and sum just take the avg cpu utilization of the 20-30 web servers as dependent variables and predict the overall avg cpu utilization? It might give you very similar results OR reveal a trend like if avg cpu utilization of servers, say for servers 13 , 11 and 5, an increase the total avg utilization in fact increases the overall cpu utilization more than you would expect just by applying unitary or logarithmic dependency. The coefficients of the servers will reveal the extent to which they affect the overall cpu utilization. This is a fascinating case and I am looking forward to your response. Happy ML!!

Sukisen1981 · ‎09-18-2017

Hi,

Couple of things here.
The regression equation that you get , and if it is having R^2 ~ 98% is a good fit.
I am assuming all the dependent variables can be got from the index(and actual CPU utilization is missing in the splunk index), now you can use the regression equation generated by your linear regression algorithm to predict the CPU utilization.

Regarding the KALMN filter , it is important to realize that this is a time series predictor. There s a 'method' dropdown under the KALMAN filter, you can try the various methods and see which is predicting better a lot of which can be done WITHOUT the machine learning toolkit - http://docs.splunk.com/Documentation/SplunkCloud/6.5.1/SearchReference/Predict

My suggestion (and just a suggestion) is to use your regression model , which as you say returns good results rather than the kalman time series predictor, remember the time series predictor just predicts future values based on historical trends of the same value mostly. It is great if you have seasonal variations etc. which is the most important factor in prediction BUT if you want to make a prediction based on dependent variables (like CPU utilization based on no.of log ins etc.) it is better to go for a regression model.
Have you tried the decision tree regressor and the random forest regressor in the ML toolkit? They might actually give even better results than the linear regression.

ankycampy · ‎09-18-2017

Hi,

My Linear regression model R^2 value is coming as 1 which means its perfectly predicting the field, now to predict it for future 5 or 10 mins what do i need to to ?

As mentioned in MLTK, Linear regression is used to detect anomaly if there is big difference in the original field value and predicted field value.
I am not getting anomaly and getting 100% matching predicted results, now to predict it for future 5 or 10 mins, do i need to use predict command (which uses kalman filter) in backend ?

In my original query "predicted(Average)" is the output field of linear regression with predicted values, now to forecast it for future time i am transforming it to time series data and then using forecast using kalman filter ? is this correct ? or i can predict the "predicted(Average)" values for future time without using kalman filter ?

Linear regression query using prediction model :-
index=main sourcetype=cpumetric metric_name=CPUUtilization Environment="WEB" Average>2.00
| apply "Predict_CPUUtilization"
| table _time, "Average", "predicted(Average)"

Added below query to get forecast :-
| rename predicted(Average) as Avrg | timechart span=15m avg(Avrg)
| predict "avg(Avrg)" as prediction algorithm="LLP5" future_timespan="3" holdback="0" lower"50"=lower"50" upper"50"=upper"50"
| `forecastviz(3, 0, "avg(Avrg)", 50)

manish_singh_77 · ‎01-23-2019

@Sukisen1981

I am also facing the same issue, Kalman filter algorithm prediction is not in line with the actual values. Linear Regression is the best fit to predict numerical values for CPU, Memory & Disk. But we can't see the forecast for next 30 days in this model. How do I proceed? Any thoughts on that??

MuS · ‎09-17-2017

Not an answer, but you can add the first two where statements into the base search like so:

 index=main sourcetype=cpumetric metric_name=CPUUtilization Environment="WEB" Average>2.00

cheers, MuS

ankycampy · ‎09-17-2017

Hi, Corrected

Predictive analysis using linear regression and kalman filter

Announcing Scheduled Export GA for Dashboard Studio

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!