Re: Splunk Machine Learning App / Toolkit - Using ...

hbrandt84 · ‎07-25-2017

Hi,

I want to use the Clustering Algorithm "DBSCAN" from the Machine Learning Toolkit.
(https://docs.splunk.com/Documentation/MLApp/2.3.0/User/Algorithms) --> listed under "clustering algorithms"

Now, upon implementation, I noticed, that this algorithm only needs one parameter: EPS
(maximum distance between two samples for them to be considered in the same cluster)

Now if you look up any definition of the DBSCAN Algorithm, for example...
(https://en.wikipedia.org/wiki/DBSCAN)
...you will notice that a DBSCAN algorithm will need 2 Parameters to be functional:

EPS (Epsilon): maximum distance between two samples --> provided
minPTS: minimum occurences of samples within a cluster --> missing

Does anybody know, why the second Parameter ist missing?
I Don't get how this algorithm can be functional....

nryabykh · ‎11-15-2017

You need to modify $SPLUNK_HOME/etc/apps/Splunk_ML_Toolkit/bin/algos/DBSCAN.py file. In __init__ function replace string

out_params = convert_params(options.get('params', {}), floats=['eps'])

with this one:

out_params = convert_params(options.get('params', {}), floats=['eps', 'min_samples'])

After this you can write something like fit DBSCAN eps=0.1 min_samples=2 * in your SPL queries.

niketn · ‎07-25-2017

@hbrandt84, I concur, scikit learn also mentions two parameters i.e. min_samples and eps (http://scikit-learn.org/stable/modules/clustering.html#dbscan)

However, algorithm description and class detail mention that these parameters are optional:
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

Based on the following code for DBSCAN algorithm, I would expect that initialization default value is min_samples=5 (https://github.com/scikit-learn/scikit-learn/blob/ab93d65/sklearn/cluster/dbscan_.py#L156):

def dbscan(X, eps=0.5, min_samples=5, metric='minkowski',
           algorithm='auto', leaf_size=30, p=2, sample_weight=None, n_jobs=1):

And:

def __init__(self, eps=0.5, min_samples=5, metric='euclidean',
             algorithm='auto', leaf_size=30, p=None, n_jobs=1):
    self.eps = eps
    self.min_samples = min_samples
    self.metric = metric
    self.algorithm = algorithm
    self.leaf_size = leaf_size
    self.p = p
    self.n_jobs = n_jobs

However, this needs to be confirmed and possibly enhanced in Machine Learning Toolkit to create a min_samples input parameter for DBSCAN.

____________________________________________
| makeresults | eval message= "Happy Splunking!!!"

Splunk Machine Learning App / Toolkit - Using DBSCAN Clustering Algorithm

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Deep Dive: Accelerate threat investigation with Splunk’s AI Assistant in Security

Announcing Modern Navigation: A New Era of Splunk User Experience

Detection Engineering Office Hours: Real-World Troubleshooting & Q&A

Join the Conversation