Solved: Re: splunk 6.1.2 + predict Questions/Clarification...

HattrickNZ · ‎03-05-2015

I am trying to get a betterunderstanding of the predict function in splun 6.1.2

I have the below search
... | predict SGSN02KPR as predict1 future_timespan=10 holdback=0

Questions:

what default algorithim is used here? is it LL | LLP | LLT | LLB | LLP5?
what is the holdback argument do? from the documentation it "Specifies the of data points from the end that are NOT used to build the model." So if I had 10 datapoints 1-10, would holdback=3 build the model using the data point 8,9,10. Or would it build it using the data points 4-10 or is it something else.
Also in the below image, the blue line is my actual data points, and the yellow is the prediction. Why or what is the yellow line overlapping on the blue line? And can i get it to start from where the blue line finishes?

Clarifications:

future_timespan=10 predicts 10 days into the future as i have span=d in the timechart part of my query.

tlagatta_splunk · ‎03-09-2015

Hi @HattrickNZ.

The default algorithm is LLP5 ("Uses the sum of the LLT and LLP models for its combined prediction.").
The holdback command specifies the number of data points from the end that are NOT used to build the model. In your example, holdback=3 means "build a model from points 1-7 and predict the values for points 8-10". This is good for testing and validating the predict command.
The predict command uses a Kalman filter to make its prediction, which incorporates a noisy model for the real world. The yellow line is the "best guess" of the "true state" of the world. Since Splunk does a good job capturing your data, you should expect that the blue timechart and yellow best-guess up pretty closely (as they do in your image). If they don't line up, then don't trust the prediction (you can add more data, choose a finer span, etc.).

Hope this helps.

View solution in original post

tlagatta_splunk · ‎03-09-2015

Hi @HattrickNZ.

The default algorithm is LLP5 ("Uses the sum of the LLT and LLP models for its combined prediction.").
The holdback command specifies the number of data points from the end that are NOT used to build the model. In your example, holdback=3 means "build a model from points 1-7 and predict the values for points 8-10". This is good for testing and validating the predict command.
The predict command uses a Kalman filter to make its prediction, which incorporates a noisy model for the real world. The yellow line is the "best guess" of the "true state" of the world. Since Splunk does a good job capturing your data, you should expect that the blue timechart and yellow best-guess up pretty closely (as they do in your image). If they don't line up, then don't trust the prediction (you can add more data, choose a finer span, etc.).

Hope this helps.

HattrickNZ · ‎03-10-2015

tks @tlagatta_splunk very helpful

default algorithm is LLP5...oops should have read that from the docs
in my example i used holdback =0 to use all values in my model but it does not seem to do that, it always seems to predict for values I already have, hence the yellow line overlapping the blue line above.
how would i do justa simple linear forecast?

Other observations

in your search the future timespans need to line up ... latest=+100d@d ... future_timespan=100
holdback also has something to do with the last date in your timespan e.g. if you use future_timespan=100 the last date will be 100 minus the holdback value....

tlagatta_splunk · ‎03-10-2015

Hi @HattrickNZ, glad it helped.

"it always seems to predict for values I already have"

This is a feature, not a bug! You should always predict the past values, to calibrate the prediction and make sure it's doing what you expect it to do. In many cases, the first attempt will do a poor job of predicting the past, which means you have to tweak it to make things work (e.g., add more historical data or make the timespan finer, like change span=1mon to span=1w). If you only predict the future, you won't know if the prediction is bad or not until you have to make decisions on it, which is usually too late.

"how would i do justa simple linear forecast?"

Unfortunately, simple linear regressions are not implemented in the core product right now. If you're looking for just linear trendlines, this community-wiki post on plotting a linear trendline might help. Keep in mind that the predict command implements a Kalman filter, so it's a pretty robust way to make temporal predictions.

HattrickNZ · ‎03-10-2015

tks again @tlagatta_splunk

so can i control how many past values it will predict for calibrartion? Is there a min defalult setting of the number of past values it will predict for calibration? And is this the holdback or something else?

From a visual point of view it would be good to be able to do the calibration and then have the option to remove it also. but hey 🙂

tlagatta_splunk · ‎03-10-2015

"so can i control how many past values it will predict for calibration? Is there a min default setting of the number of past values it will predict for calibration?"

By default, the predict command uses all past values to build a model of the timeseries (incl. best-fit curve and uncertainty envelope). The holdback argument allows you to leave recent points out of the training process.

If you have enough data points (1 time span = 1 data point), then the best-fit curve and uncertainty envelope should both track closely to the past data. If not, then add more historical data or choose a finer span.

"From a visual point of view it would be good to be able to do the calibration and then have the option to remove it also. but hey :)"

I do not advise removing this, even for visualization purposes. If something in your data changes and the prediction loses its accuracy (e.g., some rare event occurs and severely changes the model), then you want to see that immediately. When you use the predict command to make decisions, you should do so based on both the past & future trendlines, rather than a mix of the raw data & the future trendline alone.

In terms of options, you can always use the search language to further manipulate the data. The following query will remove the prediction from rows where the count field is non-null. I can't prevent you from doing this, but I do strongly advise you against it 🙂

| foreach prediction [eval <>=if(isnotnull(count), null(), '<>')]

HattrickNZ · ‎05-07-2015

@tlagatta_splunk thanks very much for your help on this....

How do I not include todays value in the real values, because I am working on max values per day and if I run this search in the morning the max for today won't be hit til later today, so I would like to remove(not use) todays value? In fact I can do this using holdback=1. But that won't stop it showing in the graph. I wonder is there a way to remove this?

My search looks something like:
...earliest=-120d@d latest=+300d@d | timechart span=d max(KPI1) by DeviceName | predict Device1 as predict1 future_timespan=300 holdback=10

splunk 6.1.2 + predict Questions/Clarifications + algorightim, hold back, overlapping lines

Harnessing Splunk’s Federated Search for Amazon S3

Infographic provides the TL;DR for the 2024 Splunk Career Impact Report

Enterprise Security Content Update (ESCU) | New Releases