All Apps and Add-ons

Machine Learning K-Means Clustering Label Question

johannthum
Explorer

Hi all,

I have some questions about the k-means clustering, and would be good to get some confirmation on this. What happened was I have trained my model with sample data and it clustered my data into different clusters, e.g. Cluster 1, Cluster 2, ...., Cluster k, that's good. But then if I would to use the same sets of data (e.g. same time range and SPL) to "apply" the model which has just been trained, the Cluster label which I got as a result from that "apply" didn't seem to match (to the trained model) based on the statistics.

How I checked the statistics was I did | outputlookup to two different files (1 from the fit command, and 1 for apply command), and did a ... | stats count by cluster. For example,
Outputlookup A (from fit command)
cluster count
1 8000
2 23
3 55

Outputlookup B (from apply command)
cluster count
1 55
2 8000
3 23

My question is if the behavior of "random" cluster labeling from apply is expected or it should have been stick to the same label from the trained model. I'm thinking it makes more sense in the latter. If someone could confirm this then it would be great!

Thank you

0 Karma
1 Solution

harshpatel
Contributor

This should not happen actually. I tried to regenerate this with iris sample data set and it worked as it should. At the coding level it uses scikit-learn and when you call fit it will train and store the model named whatever you are passing in the fit command to later use that model. When you run apply command with the name of the saved model it uses the same model which was trained and if given the same data points which were used while fitting it should result into same cluster labels.

KMeans example on iris dataset:

fitting the model:

| inputlookup iris.csv 
| fit KMeans k=3 petal* into test_kmeans 
| stats count by cluster

output:

cluster count
0           52
1           50
2           48

Applying model to same data:

| inputlookup iris.csv 
| apply test_kmeans 
| stats count by cluster

output:

cluster count
0           52
1           50
2           48

View solution in original post

harshpatel
Contributor

This should not happen actually. I tried to regenerate this with iris sample data set and it worked as it should. At the coding level it uses scikit-learn and when you call fit it will train and store the model named whatever you are passing in the fit command to later use that model. When you run apply command with the name of the saved model it uses the same model which was trained and if given the same data points which were used while fitting it should result into same cluster labels.

KMeans example on iris dataset:

fitting the model:

| inputlookup iris.csv 
| fit KMeans k=3 petal* into test_kmeans 
| stats count by cluster

output:

cluster count
0           52
1           50
2           48

Applying model to same data:

| inputlookup iris.csv 
| apply test_kmeans 
| stats count by cluster

output:

cluster count
0           52
1           50
2           48

johannthum
Explorer

Thanks for the efforts to simulate this! I'm thinking (and relying) that the label to be consistent. I will do more testing on my end and check if adjusting k would help in this case.

0 Karma

johannthum
Explorer

You're right. Turned out that I was applying the wrong model. As I had not published the model yet, the model name was pretty cryptic and din't realize the model which I was applying had a draft prefixed to it. I fixed the model and the statistic looks good now! Thanks again for verifying this!

harshpatel
Contributor

No problem 🙂

0 Karma
Get Updates on the Splunk Community!

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...