All Apps and Add-ons

DSDL - Kmeans yields different results for fit and apply commands?

Gabriel
Explorer

Hello everyone

I am using the DSDL app: https://splunkbase.splunk.com/app/4607

The model I use is sklearn's kmeans: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

My goal is to cluster a dataset with kmeans and then assign new observations to center points form my kmeans model.

I built the model and everything compiled just fine. I then wanted to check my model for possible logical errors. To this end, I use the same dataset in the | fit and | apply command:

| inputlookup wineventlog.csv
| fit MLTKContainer algo=Pipeline_V3 k=5 fe_* into app:pipeline_v3

| inputlookup wineventlog.csv
| apply pipeline_v3

To my understanding, this should yield the exact same results. However, it does not. | fit creates 5 clusters and assigns observations to each cluster. | apply on the other hand only assigns observations to 3 out of the 5 clusters. Does anyone have a precise idea what goes on behind the scenes with | fit and | apply? I already checked the documentation that outlines what they both do:

https://docs.splunk.com/Documentation/MLApp/5.3.3/User/Understandfitandapply

I thoroughly checked whether I have null values, non-numeric fields that might get converted, etc. but I could not figure out why fit & apply wouldn't yield the same result. Following the code I use for each respective command.

def fit(model, df, param):
# Number of clusters
k = int(param["options"]["params"]["k"])

# Fit kmeans model
kmeans = KMeans(n_clusters=k, random_state=0).fit(df)
model["kmeans"] = kmeans

return model

def apply(model, df, param):

# Assign new observations to kmeans centers
predictions = model["kmeans"].predict(df)

return predictions

Labels (1)
Tags (2)
0 Karma
1 Solution

pdrieger_splunk
Splunk Employee
Splunk Employee

Thanks @Gabriel I resolved your issue on https://github.com/splunk/splunk-mltk-container-docker/issues/30 by adding the correct feature selection with X = df[param['feature_variables']] - please consider marking this answer here as solved if this resolves your issue. Thanks and have a nice weekend 🙂

View solution in original post

pdrieger_splunk
Splunk Employee
Splunk Employee

Hi @Gabriel ,

thanks for working with DSDL and sharing your findings here with the community. This behaviour is indeed not what is expected. Let me ask a few question and share some thoughts to find a way to resolve this issue.

1. Did you implement the def load() and def save() in your notebook? If so, how did you serialise the KMeans model object (which would contain the trained state after running your fit).

2. MLTK's fit and apply commands have some specific behaviour: fit is an eventing command, apply is a stateful streaming command. But if your model is properly loaded you should not see this deviation.

3. While I appreciate to see you implementing in DSDL, you probably know that MLTK has KMeans out of the box and might be easier to use to achieve the same goal. Would this be an alternative?

4. For DSDL you can also consider opening a support case for this issue. If your notebook and sample data is anonymised and contains no sensitive information, you could share it so we can reproduce your issue.

Hope this is helpful for you. Please let us know!

Gabriel
Explorer

Hi @pdrieger_splunk

First off, thanks for your reply 😃

1. Yes, I used pythons pickle library (import pickle). I appended pictures of the codes.


2. After I load the model, it seems to be exactly the same as before saving it. Also checkout the picktures appended.


3. Thanks for the heads up. I use DSDL for practicing purposes and kmeans due to it being relatively simple and unsupervised.

4. Am I right to assume reporting an issue here https://github.com/splunk/splunk-mltk-container-docker would be correct?

save_model.PNG

load_model.PNG

Tags (1)
0 Karma

pdrieger_splunk
Splunk Employee
Splunk Employee

Thanks @Gabriel I resolved your issue on https://github.com/splunk/splunk-mltk-container-docker/issues/30 by adding the correct feature selection with X = df[param['feature_variables']] - please consider marking this answer here as solved if this resolves your issue. Thanks and have a nice weekend 🙂

Gabriel
Explorer

Hi @pdrieger_splunk

I even tested that as well but must have done something wrong in the process. Thanks for your quick and detailed help, greatly appreciated! 🎉

0 Karma
Get Updates on the Splunk Community!

The Splunk Success Framework: Your Guide to Successful Splunk Implementations

Splunk Lantern is a customer success center that provides advice from Splunk experts on valuable data ...

Splunk Training for All: Meet Aspiring Cybersecurity Analyst, Marc Alicea

Splunk Education believes in the value of training and certification in today’s rapidly-changing data-driven ...

Investigate Security and Threat Detection with VirusTotal and Splunk Integration

As security threats and their complexities surge, security analysts deal with increased challenges and ...