- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey Guys,
Hope you can give insights on this.
Currently, we are using Machine Learning in predicting new ticket type/category. We are using the whole index as an input to train our model.
My question, is it right to use the whole index? Or does ML just need a new set of data/events in training the model?,Hey Guys,
Hope you can give insights on this.
Currrently, we are using Machine Learning(ML) to predict a certain ticket type/category. We are using the whole index as an input to train the model.
My question, is it right to use the whole index as an input? Or does ML just need the new set of data in training the model?
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

As a general case, NO. You should reserve at least half the data for testing the results of the training. Otherwise, how will you validate that the result is reasonable?
Second, "predicting new ticket type/category" is pretty vague. What is the research question? What are you looking to achieve by having this new category? What kind of tickets are we talking about - airplane tickets, trouble tickets, concert tickets, sports tickets?
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

As a general case, NO. You should reserve at least half the data for testing the results of the training. Otherwise, how will you validate that the result is reasonable?
Second, "predicting new ticket type/category" is pretty vague. What is the research question? What are you looking to achieve by having this new category? What kind of tickets are we talking about - airplane tickets, trouble tickets, concert tickets, sports tickets?
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the response @DalJeanis!
By ticket, i mean these are incidents logged by the users. We are using ML to auto-categorize the logged incident if it is an 'Admin Request' or 'Change Request' etc.
For the data, what if there are new set of data ingested in Splunk and was also auto-categorized, is it advisable to use that as a training data for ML?
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Do you know what's the SPL command to split training and testing, I didn't see it at doc. Thanks
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

It is not the clearest thing in the docs, but you use the sample command that comes with MLTK and specifically use the partitions option (set to 10 is usually what you want) and then you have to search on partition_number < X. If you are doing the 70/30 split would be less than 7 as it starts counting at 0 and make sure to use seed option so you can come back and search partition_number > X-1 to get the other side of the split.
Training set:
Test set:
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can take splits between training and test as 70/30. Hence it will take 70% data for training and 30% for testing.
