All Apps and Add-ons
Highlighted

Data to Train Machine Learning

Engager

Hey Guys,

Hope you can give insights on this.

Currently, we are using Machine Learning in predicting new ticket type/category. We are using the whole index as an input to train our model.

My question, is it right to use the whole index? Or does ML just need a new set of data/events in training the model?,Hey Guys,

Hope you can give insights on this.

Currrently, we are using Machine Learning(ML) to predict a certain ticket type/category. We are using the whole index as an input to train the model.

My question, is it right to use the whole index as an input? Or does ML just need the new set of data in training the model?

0 Karma
Highlighted

Re: Data to Train Machine Learning

SplunkTrust
SplunkTrust

As a general case, NO. You should reserve at least half the data for testing the results of the training. Otherwise, how will you validate that the result is reasonable?

Second, "predicting new ticket type/category" is pretty vague. What is the research question? What are you looking to achieve by having this new category? What kind of tickets are we talking about - airplane tickets, trouble tickets, concert tickets, sports tickets?

View solution in original post

0 Karma
Highlighted

Re: Data to Train Machine Learning

Engager

Thanks for the response @DalJeanis!

By ticket, i mean these are incidents logged by the users. We are using ML to auto-categorize the logged incident if it is an 'Admin Request' or 'Change Request' etc.

For the data, what if there are new set of data ingested in Splunk and was also auto-categorized, is it advisable to use that as a training data for ML?

0 Karma
Highlighted

Re: Data to Train Machine Learning

New Member

You can take splits between training and test as 70/30. Hence it will take 70% data for training and 30% for testing.

0 Karma
Highlighted

Re: Data to Train Machine Learning

New Member

Do you know what's the SPL command to split training and testing, I didn't see it at doc. Thanks

0 Karma
Highlighted

Re: Data to Train Machine Learning

Contributor

It is not the clearest thing in the docs, but you use the sample command that comes with MLTK and specifically use the partitions option (set to 10 is usually what you want) and then you have to search on partitionnumber < X. If you are doing the 70/30 split would be less than 7 as it starts counting at 0 and make sure to use seed option so you can come back and search partitionnumber > X-1 to get the other side of the split.
Training set:
| sample partitions=10 seed=1234 | search partitionnumber < 7 | fit MLAlgoName targetfield from whateverfields into savedmodelname
Test set:
| sample partitions=10 seed=1234 | search partition
number > 6 | apply savedmodelname as predicted_field