Solved: Predict - 95% Confidence Interval

arielpconsolaci · ‎03-28-2017

I have read through Splunk docs that Splunk defaults lower and upper Confidence Interval to 95% for its prediction using predict command. I am trying to understand further its interpretation (i.e. what Confidence Interval means to my dashboard). Technically, please help advise as to why is it defaulted at 95% rather than any other percentages? and also of its computation if possible?

Responses are highly appreciated. Thank you!

Richfez · ‎03-28-2017

Hi, arielpconsolacion.

This is all about statistics. 95% is two standard deviations, a very common "threshold" used in statistics and the sciences for "if something's very likely."

The beginning and "simple examples" section of the wikipedia article for standard deviation help a little. There's a linked article on the 68-95-99.7 rule which also has some decent information in it too.

A simplified and vague example may help.

Let's assume you are counting network events. (The actual thing you are counting or measuring doesn't really matter). If you have over the past 6 hours a pretty consistent measurement, let's say right around 100 events per minute (Like, 98, 102, 100,101, 99, 106, 100, 99, 97, 98 ... essentially hovering close to 100) , then you have a couple of things you can say.

You could say the most likely individual measurement you would expect 5 minutes from now is 100. But you would be hard pressed to say it WILL be 100, right? Just that it's likely close.

Well, if you did all the math to find your standard deviation (I don't suggest it except in the simplest of examples), and you found that standard deviation to be 5, then you could say it's probably likely 5 minutes from now your expected measurement will be between 95 and 105 (100 +/- 5). That's your one standard deviation, where 68% of the time you'd expect the actual value you measure (when you get 5 minutes later) to within your "prediction".

Now, here's the key one: if you wanted to know the prediction for the span of your measurement in 5 minutes that should cover the expected value 95 times out of 100, you double your standard deviation and you get the "default" 95% threshold. So, in our example, 95% of the time you'd expect the actual measurement in 5 minutes to be between 90 and 110.

Here's the thing about the standard deviation in this example. We pretended to use pretty consistent initial numbers so our standard deviation is pretty low. If instead you have more varying initial numbers, like 102, 156, 99, 64, 45, 150, 155, 100...., then perhaps our "average" is still pretty close to 100 (I don't know, I didn't actually do the math!). But the standard deviation will be far greater, perhaps +/- 50. Which means your predictions will be all over the place. In that case, the "range" of actual values the count could take on 5 minutes from now is BIG. It's hard to predict. You'll still get a prediction, but it's more vague and has less "predictive value" because of all that variation, so the range will be bigger.

Here's an example of data that's pretty predictable:

You'll see that it's consistent and there's quite a bit of it. The more data points you have the better your prediction capability, right?

Here's an example of data that's so sparse it's hard to predict.

There's just so little the "predict" command can't do anything accurate with it. That's similar to having data, but it being so scattered that it's hard to predict.

In general, if you don't have 30+ data points your predictions will be poor. If you have high variation (that doesn't have a nice, easily discerned repeating pattern), your predictions will also be poor. Like the second example.

If instead you have fairly large amount of consistent (or consistently changing) data the predictions can be pretty tight. Like the first example.

View solution in original post

Richfez · ‎03-28-2017