I'm a newbie in the ML world, with some academic background, but very limited practice...
We're now working on a prediction model trying to forecast ICT Incidents based on some input variables, having very limited datapoints (basically a monthly value for the last 12 months)
We approached the problem in very "old fashion" way, trying to apply on "judgmental" basis, weights to each independent variable try to achieve a sounding forecast.
I'm now trying to back-test the model with Splunk ML (leveraging on the "Predict Numeric Field" option). I have a big question on how Splunk Machine Learning Toolkit works. Basically, leveraging on the same datasets and same prediction parameters ("Field to predict", "Predictors" and Training/ Test ratio), each time I run the "Fit Model" button I got a different outcome.
My initial understanding was that pushing "Fit Model", SML ran multiple iterations on the same dataset (with a different sample on the same dataset) to provide finally a sounding statistical result. But, in reality, it seems to me that SMLT just run a single iteration on a single sample extracted from the input dataset, just returning the result. Is my understanding correct?
Is there any chance to run multiple iterations on the same dataset (different samples) to get finally the best fit model?
Thanks for the support!
If your base search at the top of the assistant returns a static set of events, then the only source of run-to-run variation is the random split between training data and test data the assistant is doing automatically (governed by the "Split for training / test"). The assistant doesn't let you fix the random seed, but you can do this manually using the 'sample' command's 'partition' mode, plus a 'where' clause to select either the training partition or the test partition:
[your static base search] | sample partitions=2 seed=47 | where partition_number=0 | fit ... [your static base search] | sample partitions=2 seed=47 | where partition_number=1 | apply ...
I hope that helps!
Tks for the feedback, so you confirm my understanding...
The problem is that each "fit model" run provide me a different predictor model.
How can i "null" these differences (a kind of best fitting)? Is it feasible/ sounding in your opinion?
Or, as far as I've very limited datapoints (only 12 occurences) i cannot expect to have a realiable model?
Tks for the help!
If the events you use for training are the same, most algorithms will deterministically give you the same model. One obvious exception is RandomForest*.
| sample partitions=2 seed=47 | where partition_number=0 | fit LinearRegression ...
Twelve examples is not very many, especially if you're only using ~six to train and ~six to test. I wouldn't expect most algorithms to come up with a particularly good model, unless the pattern in the data is dead simple.
Mmm, agree with you.... few data and relation not really clear (i was trying to apply ML to come up with some statistical proven weights to apply to we model we created in a standard way).