All Apps and Add-ons

Splunk Machine Learning App / Toolkit - Using DBSCAN Clustering Algorithm

hbrandt84
Path Finder

Hi,

I want to use the Clustering Algorithm "DBSCAN" from the Machine Learning Toolkit.
(https://docs.splunk.com/Documentation/MLApp/2.3.0/User/Algorithms) --> listed under "clustering algorithms"

Now, upon implementation, I noticed, that this algorithm only needs one parameter: EPS
(maximum distance between two samples for them to be considered in the same cluster)

Now if you look up any definition of the DBSCAN Algorithm, for example...
(https://en.wikipedia.org/wiki/DBSCAN)
...you will notice that a DBSCAN algorithm will need 2 Parameters to be functional:

  • EPS (Epsilon): maximum distance between two samples --> provided
  • minPTS: minimum occurences of samples within a cluster --> missing

Does anybody know, why the second Parameter ist missing?
I Don't get how this algorithm can be functional....

nryabykh
Path Finder

You need to modify $SPLUNK_HOME/etc/apps/Splunk_ML_Toolkit/bin/algos/DBSCAN.py file. In __init__ function replace string

out_params = convert_params(options.get('params', {}), floats=['eps'])

with this one:

out_params = convert_params(options.get('params', {}), floats=['eps', 'min_samples'])

After this you can write something like fit DBSCAN eps=0.1 min_samples=2 * in your SPL queries.

0 Karma

niketn
Legend

@hbrandt84, I concur, scikit learn also mentions two parameters i.e. min_samples and eps (http://scikit-learn.org/stable/modules/clustering.html#dbscan)

However, algorithm description and class detail mention that these parameters are optional:
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

Based on the following code for DBSCAN algorithm, I would expect that initialization default value is min_samples=5 (https://github.com/scikit-learn/scikit-learn/blob/ab93d65/sklearn/cluster/dbscan_.py#L156):

def dbscan(X, eps=0.5, min_samples=5, metric='minkowski',
           algorithm='auto', leaf_size=30, p=2, sample_weight=None, n_jobs=1):

And:

def __init__(self, eps=0.5, min_samples=5, metric='euclidean',
             algorithm='auto', leaf_size=30, p=None, n_jobs=1):
    self.eps = eps
    self.min_samples = min_samples
    self.metric = metric
    self.algorithm = algorithm
    self.leaf_size = leaf_size
    self.p = p
    self.n_jobs = n_jobs

However, this needs to be confirmed and possibly enhanced in Machine Learning Toolkit to create a min_samples input parameter for DBSCAN.

____________________________________________
| makeresults | eval message= "Happy Splunking!!!"
0 Karma
Got questions? Get answers!

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Meet up IRL or virtually!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Get Updates on the Splunk Community!

[Puzzles] Solve, Learn, Repeat: Matching cron expressions

This puzzle (first published here) is based on matching timestamps to cron expressions.All the timestamps ...

Why Splunk Customers Should Attend Cisco Live 2026 Las Vegas

Why Splunk Customers Should Attend Cisco Live 2026 Las Vegas     Cisco Live 2026 is almost here, and this ...

Data Management Digest – May 2026

Welcome to the May 2026 edition of Data Management Digest!   As your trusted partner in data innovation, the ...