Hi all,
I have some questions about the k-means clustering, and would be good to get some confirmation on this. What happened was I have trained my model with sample data and it clustered my data into different clusters, e.g. Cluster 1, Cluster 2, ...., Cluster k, that's good. But then if I would to use the same sets of data (e.g. same time range and SPL) to "apply" the model which has just been trained, the Cluster label which I got as a result from that "apply" didn't seem to match (to the trained model) based on the statistics.
How I checked the statistics was I did | outputlookup to two different files (1 from the fit command, and 1 for apply command), and did a ... | stats count by cluster. For example,
Outputlookup A (from fit command)
cluster count
1 8000
2 23
3 55
Outputlookup B (from apply command)
cluster count
1 55
2 8000
3 23
My question is if the behavior of "random" cluster labeling from apply is expected or it should have been stick to the same label from the trained model. I'm thinking it makes more sense in the latter. If someone could confirm this then it would be great!
Thank you
This should not happen actually. I tried to regenerate this with iris sample data set and it worked as it should. At the coding level it uses scikit-learn and when you call fit it will train and store the model named whatever you are passing in the fit command to later use that model. When you run apply command with the name of the saved model it uses the same model which was trained and if given the same data points which were used while fitting it should result into same cluster labels.
KMeans example on iris dataset:
fitting the model:
| inputlookup iris.csv
| fit KMeans k=3 petal* into test_kmeans
| stats count by cluster
output:
cluster count
0 52
1 50
2 48
Applying model to same data:
| inputlookup iris.csv
| apply test_kmeans
| stats count by cluster
output:
cluster count
0 52
1 50
2 48
This should not happen actually. I tried to regenerate this with iris sample data set and it worked as it should. At the coding level it uses scikit-learn and when you call fit it will train and store the model named whatever you are passing in the fit command to later use that model. When you run apply command with the name of the saved model it uses the same model which was trained and if given the same data points which were used while fitting it should result into same cluster labels.
KMeans example on iris dataset:
fitting the model:
| inputlookup iris.csv
| fit KMeans k=3 petal* into test_kmeans
| stats count by cluster
output:
cluster count
0 52
1 50
2 48
Applying model to same data:
| inputlookup iris.csv
| apply test_kmeans
| stats count by cluster
output:
cluster count
0 52
1 50
2 48
Thanks for the efforts to simulate this! I'm thinking (and relying) that the label to be consistent. I will do more testing on my end and check if adjusting k would help in this case.
You're right. Turned out that I was applying the wrong model. As I had not published the model yet, the model name was pretty cryptic and din't realize the model which I was applying had a draft prefixed to it. I fixed the model and the statistic looks good now! Thanks again for verifying this!
No problem 🙂