Solved: Data to Train Machine Learning

kevinmabini · ‎07-24-2017

Hey Guys,

Hope you can give insights on this.

Currently, we are using Machine Learning in predicting new ticket type/category. We are using the whole index as an input to train our model.

My question, is it right to use the whole index? Or does ML just need a new set of data/events in training the model?,Hey Guys,

Hope you can give insights on this.

Currrently, we are using Machine Learning(ML) to predict a certain ticket type/category. We are using the whole index as an input to train the model.

My question, is it right to use the whole index as an input? Or does ML just need the new set of data in training the model?

DalJeanis · ‎07-25-2017

As a general case, NO. You should reserve at least half the data for testing the results of the training. Otherwise, how will you validate that the result is reasonable?

Second, "predicting new ticket type/category" is pretty vague. What is the research question? What are you looking to achieve by having this new category? What kind of tickets are we talking about - airplane tickets, trouble tickets, concert tickets, sports tickets?

View solution in original post

DalJeanis · ‎07-25-2017

As a general case, NO. You should reserve at least half the data for testing the results of the training. Otherwise, how will you validate that the result is reasonable?

Second, "predicting new ticket type/category" is pretty vague. What is the research question? What are you looking to achieve by having this new category? What kind of tickets are we talking about - airplane tickets, trouble tickets, concert tickets, sports tickets?

kevinmabini · ‎07-25-2017

Thanks for the response @DalJeanis!

By ticket, i mean these are incidents logged by the users. We are using ML to auto-categorize the logged incident if it is an 'Admin Request' or 'Change Request' etc.

For the data, what if there are new set of data ingested in Splunk and was also auto-categorized, is it advisable to use that as a training data for ML?

iceco · ‎07-31-2018

Do you know what's the SPL command to split training and testing, I didn't see it at doc. Thanks

worshamn · ‎08-10-2018

It is not the clearest thing in the docs, but you use the sample command that comes with MLTK and specifically use the partitions option (set to 10 is usually what you want) and then you have to search on partition_number < X. If you are doing the 70/30 split would be less than 7 as it starts counting at 0 and make sure to use seed option so you can come back and search partition_number > X-1 to get the other side of the split.
Training set:

nasrinmulani · ‎05-21-2018

You can take splits between training and test as 70/30. Hence it will take 70% data for training and 30% for testing.

Data to Train Machine Learning

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Announcing Modern Navigation: A New Era of Splunk User Experience

Modernize your Splunk Apps – Introducing Python 3.13 in Splunk

Step into “Hunt the Insider: An Splunk ES Premier Mystery” to catch a cybercriminal ...

Join the Conversation