Solved: Machine Learning Toolkit: Event Field Differences

bdbyerly · ‎12-19-2016

Hi Splunk,

I work for a corporate partner and am interested in the capabilities of your new Machine Learning Toolkit.
I wrote several Python scripts using the Splunk SDK for Python to do the following, but desire the capability to do this directly from Splunk via the Machine Learning toolkit or a dashboard:

(1) I want to consider all pair-wise event field differences for N events of M fields considered. This would result in ~ N^2/2 vectors (v_2,1 v_3,2 v_3,1 …) each of length M (N^2 vectors would be redundant as v_j,k = v_k,j).

(2) From these event field differences, I perform binary classification using a linear weight w determine via some linear model such as LDA or logistic regression, i.e. If ( [v_j,k] * [w_1 … w_M].T < Some threshold ) => Events j and k are similar

(3) I then perform single-link clustering for all event fields deemed similar.

As the ML toolkit implements clustering, I suppose that adaptions to the existing source code would allow one to do this, but would like to know if there is an easier way.

Blake

yangzd · ‎12-20-2016

Hi Blake,

This is an interesting algorithm!

I am proposing two solutions for you:

MLTK/SPL only:
Single-linkage clustering is not yet a supported algorithm in the current release of MLTK. But MLTK offers some other clustering algorithms such as kmeans, spectral clustering, DBSCAN and Birch. So if you don't mind trying one of the supported clustering algorithms, then here is one possible solution: (1) transpose your raw data to flip events/fields, (2) calculate differences per pair of columns and generate the M * N^2/2 matrix (you may need SPL commands such as map, join, and/or foreach), (3) perform binary classification via fit command, (4) since step (1)-(3) can be done using SPL and ML-SPL, you can copy/paste the SPL into the search bar in the Clustering dashboard of MLTK and try out different supported clustering methods.
Via ML-SPL API:
If you already have your custom script ready to perform the transformation and clustering as you described, you can wire it up with ML-SPL and it could be more convenient than the first method. A reference script will be SpectralClustering.py and DBSCAN.py in the Splunk_ML_Toolkit/bin/algos directory, where you can follow the way fit_predict is implemented and replace it with your own script.

Hope it's useful.

View solution in original post

yangzd · ‎12-20-2016

Hi Blake,

This is an interesting algorithm!

I am proposing two solutions for you:

MLTK/SPL only:
Single-linkage clustering is not yet a supported algorithm in the current release of MLTK. But MLTK offers some other clustering algorithms such as kmeans, spectral clustering, DBSCAN and Birch. So if you don't mind trying one of the supported clustering algorithms, then here is one possible solution: (1) transpose your raw data to flip events/fields, (2) calculate differences per pair of columns and generate the M * N^2/2 matrix (you may need SPL commands such as map, join, and/or foreach), (3) perform binary classification via fit command, (4) since step (1)-(3) can be done using SPL and ML-SPL, you can copy/paste the SPL into the search bar in the Clustering dashboard of MLTK and try out different supported clustering methods.
Via ML-SPL API:
If you already have your custom script ready to perform the transformation and clustering as you described, you can wire it up with ML-SPL and it could be more convenient than the first method. A reference script will be SpectralClustering.py and DBSCAN.py in the Splunk_ML_Toolkit/bin/algos directory, where you can follow the way fit_predict is implemented and replace it with your own script.

Hope it's useful.

Machine Learning Toolkit: Event Field Differences

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Best Practices: Splunk auto adjust pipeline queue

Laser Bananas and Edge Hubs: Exploring Operational Technology (OT) Data Through a ...

Event Series: Mastering AI Tokenomics and Splunk Agent Observability

Join the Conversation