Solved: Machine Learning K-Means Clustering Label Question

johannthum · ‎04-10-2019

Hi all,

I have some questions about the k-means clustering, and would be good to get some confirmation on this. What happened was I have trained my model with sample data and it clustered my data into different clusters, e.g. Cluster 1, Cluster 2, ...., Cluster k, that's good. But then if I would to use the same sets of data (e.g. same time range and SPL) to "apply" the model which has just been trained, the Cluster label which I got as a result from that "apply" didn't seem to match (to the trained model) based on the statistics.

How I checked the statistics was I did | outputlookup to two different files (1 from the fit command, and 1 for apply command), and did a ... | stats count by cluster. For example,
Outputlookup A (from fit command)
cluster count
1 8000
2 23
3 55

Outputlookup B (from apply command)
cluster count
1 55
2 8000
3 23

My question is if the behavior of "random" cluster labeling from apply is expected or it should have been stick to the same label from the trained model. I'm thinking it makes more sense in the latter. If someone could confirm this then it would be great!

Thank you

harshpatel · ‎04-10-2019

This should not happen actually. I tried to regenerate this with iris sample data set and it worked as it should. At the coding level it uses scikit-learn and when you call fit it will train and store the model named whatever you are passing in the fit command to later use that model. When you run apply command with the name of the saved model it uses the same model which was trained and if given the same data points which were used while fitting it should result into same cluster labels.

KMeans example on iris dataset:

fitting the model:

| inputlookup iris.csv 
| fit KMeans k=3 petal* into test_kmeans 
| stats count by cluster

output:

cluster count
0           52
1           50
2           48

Applying model to same data:

| inputlookup iris.csv 
| apply test_kmeans 
| stats count by cluster

output:

cluster count
0           52
1           50
2           48

View solution in original post

harshpatel · ‎04-10-2019

This should not happen actually. I tried to regenerate this with iris sample data set and it worked as it should. At the coding level it uses scikit-learn and when you call fit it will train and store the model named whatever you are passing in the fit command to later use that model. When you run apply command with the name of the saved model it uses the same model which was trained and if given the same data points which were used while fitting it should result into same cluster labels.

KMeans example on iris dataset:

fitting the model:

| inputlookup iris.csv 
| fit KMeans k=3 petal* into test_kmeans 
| stats count by cluster

output:

cluster count
0           52
1           50
2           48

Applying model to same data:

| inputlookup iris.csv 
| apply test_kmeans 
| stats count by cluster

output:

cluster count
0           52
1           50
2           48

johannthum · ‎04-10-2019

Thanks for the efforts to simulate this! I'm thinking (and relying) that the label to be consistent. I will do more testing on my end and check if adjusting k would help in this case.

johannthum · ‎04-10-2019

You're right. Turned out that I was applying the wrong model. As I had not published the model yet, the model name was pretty cryptic and din't realize the model which I was applying had a draft prefixed to it. I fixed the model and the statistic looks good now! Thanks again for verifying this!

harshpatel · ‎04-10-2019

No problem 🙂

Machine Learning K-Means Clustering Label Question

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

From Data to Insight: Announcing the Winners of the Splunk Dashboard Contest

Splunk Developers: Construct Your Future at the .conf26 Builder Bar

Quick connection discovery mode for forwarders

Join the Conversation

Machine Learning K-Means Clustering Label Question

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

From Data to Insight: Announcing the Winners of the Splunk Dashboard Contest

Splunk Developers: Construct Your Future at the .conf26 Builder Bar

Quick connection discovery mode for forwarders