I am trying to train clustering model but keep running in the memory limit error because the data is big. I would like to use event sampling but I am not aware of the command for it.
How to I set a sampling ratio for initial search for Splunk MLTK? Do we have specific SPL command for that?
How big is your sample data? Do you need to train on this large of sample data? Why not train on a smaller sample set if it represents a good percentage of the data needed?
Are you sure you're not bumping into limits as opposed to running out of memory?
Skoelpin , I have 500k observation. I want to limit to smaller set because I am just using a MLTK sandbox to judge if MLTK is a right solution for us before configuring it in PROD.
Let me know if you have solution. Thanks!
The point I'm trying to make is, why sample a larger data set when you can just reduce the size of the training data set.
Are you sure you're not bumping into limits as opposed to running out of memory?
Lastly, the MLTK is a collection of libraries imported into Splunk. It will work if you're giving it the right data and ask the right questions.
When we just reduce the size of the training data set, it doesn't randomly select the observation(rows/events). As a result, the data can't closely represent the whole population data-set.
If we using sampling, the data is randomly selected and it is more representative of our data-set.
You are right, I am bumping into limits. I have already requested to increase the limit. In the meantime, I wanted to learn about how I can sample using SPL to serve the immediate needs.
You can use event sampling above the search bar to accomplish this. You can also use certain SPL techniques to do sampling such as
|eval samplingperc=20
| eval search=ceil(100/samplingperc)
Which means, sample 20% of the data.
Lastly, you can control these limits in the MLTK UI directly under the Settings
tab in the nav bar. If this answered, your question, please accept it
Thanks a lot.
These commands doesn't seem to work. Are there any limitations ?