## Machine Learning Toolkit: Event Field Differences

Engager

Hi Splunk,

I work for a corporate partner and am interested in the capabilities of your new Machine Learning Toolkit.
I wrote several Python scripts using the Splunk SDK for Python to do the following, but desire the capability to do this directly from Splunk via the Machine Learning toolkit or a dashboard:

(1) I want to consider all pair-wise event field differences for N events of M fields considered. This would result in ~ N^2/2 vectors (v2,1 v3,2 v3,1 โฆ) each of length M (N^2 vectors would be redundant as vj,k = v_k,j).

(2) From these event field differences, I perform binary classification using a linear weight w determine via some linear model such as LDA or logistic regression, i.e. If ( [vj,k] * [w1 โฆ w_M].T < Some threshold ) => Events j and k are similar

(3) I then perform single-link clustering for all event fields deemed similar.

As the ML toolkit implements clustering, I suppose that adaptions to the existing source code would allow one to do this, but would like to know if there is an easier way.

Blake

Tags (2)
1 Solution
Splunk Employee

Hi Blake,

This is an interesting algorithm!

I am proposing two solutions for you:

1. MLTK/SPL only:
Single-linkage clustering is not yet a supported algorithm in the current release of MLTK. But MLTK offers some other clustering algorithms such as kmeans, spectral clustering, DBSCAN and Birch. So if you don't mind trying one of the supported clustering algorithms, then here is one possible solution: (1) transpose your raw data to flip events/fields, (2) calculate differences per pair of columns and generate the M * N^2/2 matrix (you may need SPL commands such as map, join, and/or foreach), (3) perform binary classification via fit command, (4) since step (1)-(3) can be done using SPL and ML-SPL, you can copy/paste the SPL into the search bar in the Clustering dashboard of MLTK and try out different supported clustering methods.

2. Via ML-SPL API:
If you already have your custom script ready to perform the transformation and clustering as you described, you can wire it up with ML-SPL and it could be more convenient than the first method. A reference script will be SpectralClustering.py and DBSCAN.py in the SplunkMLToolkit/bin/algos directory, where you can follow the way fit_predict is implemented and replace it with your own script.

Hope it's useful.

Splunk Employee

Hi Blake,

This is an interesting algorithm!

I am proposing two solutions for you:

1. MLTK/SPL only:
Single-linkage clustering is not yet a supported algorithm in the current release of MLTK. But MLTK offers some other clustering algorithms such as kmeans, spectral clustering, DBSCAN and Birch. So if you don't mind trying one of the supported clustering algorithms, then here is one possible solution: (1) transpose your raw data to flip events/fields, (2) calculate differences per pair of columns and generate the M * N^2/2 matrix (you may need SPL commands such as map, join, and/or foreach), (3) perform binary classification via fit command, (4) since step (1)-(3) can be done using SPL and ML-SPL, you can copy/paste the SPL into the search bar in the Clustering dashboard of MLTK and try out different supported clustering methods.

2. Via ML-SPL API:
If you already have your custom script ready to perform the transformation and clustering as you described, you can wire it up with ML-SPL and it could be more convenient than the first method. A reference script will be SpectralClustering.py and DBSCAN.py in the SplunkMLToolkit/bin/algos directory, where you can follow the way fit_predict is implemented and replace it with your own script.

Hope it's useful.