Solved: How to remove a cluster after applying DBSCAN?

rosho · ‎06-18-2019

Hi

I am working on a forecasting problem.
I want to use DBSCAN to detect outliers and then apply Kalman filter to make forecasts.

But I do not know how to remove or transform the samples inside a cluster.
How can I connect these 2 "algorithms"

#THIS IS TO APPLY DBSCAN
| inputlookup fortigate_QC_May2019_logins.csv
| fit StandardScaler "logins" with_mean=false with_std=true
| fit DBSCAN eps=0.2 "SS_logins" 



#THIS IS TO FORECAST WITH KALMAN FILTER
| predict "logins" as prediction algorithm=LLP5 holdback=288 future_timespan=324 upper95=upper95 lower95=lower95 
| `forecastviz(324, 288, "logins", 95)`
| where prediction!="" AND 'logins' != ""
| `regressionstatistics("logins", prediction)`

Thank you

rosho · ‎07-02-2019

This is the SPL:

|fit DBSCAN eps=0.6 "SS_logins"
|where NOT cluster==-1
| predict "SS_logins" as prediction algorithm=LLP holdback=288 future_timespan=324 upper95=upper95 lower95=lower95
|forecastviz(324, 288, "SS_logins", 95)

The 2nd line is how I remove the clusters.

View solution in original post

rosho · ‎07-02-2019

This is the SPL:

|fit DBSCAN eps=0.6 "SS_logins"
|where NOT cluster==-1
| predict "SS_logins" as prediction algorithm=LLP holdback=288 future_timespan=324 upper95=upper95 lower95=lower95
|forecastviz(324, 288, "SS_logins", 95)

The 2nd line is how I remove the clusters.

pdrieger_splunk · ‎06-19-2019

Hi rosho,

let's assume your outlier detected by DBSCAN are marked with a cluster=-1 then you can easily exclude them from your search results of the first part of your search by filtering with | where cluster>-1. Subsequently you can run your forecasting part.

However I would recommend to you to have equidistant timestamps e.g. by using a | timechart command before your forecasting part to have a proper input for many forecasting algorithms. You might also think of filling the gaps with imputed values for the sake of training your forecasting model on your "cleaned" assumptions. You might find the Imputer useful here: https://docs.splunk.com/Documentation/MLApp/4.3.0/User/Algorithms#Imputer

Instead of | predict I would also highly recommend to you to have a look at the StateSpaceForecast algorithm newly introduced in the MLTK 4.2: https://docs.splunk.com/Documentation/MLApp/4.3.0/User/Algorithms#StateSpaceForecast

You might find this blog useful that explains it with an example: https://www.splunk.com/blog/2019/03/20/what-s-new-in-the-splunk-machine-learning-toolkit-4-2.html

Hope this is helpful to you?

rosho · ‎06-20-2019

Is "intervention detection" the same as the "Imputer"?

Intervention detection

I would replace packets of contiguous missing values with hourly averages around the missing values. If the values are not missing but are anomalous either manually adjust them or estimate what they should have been via **Intervention Detection* which is essentially a forward prediction/fitted value for an anomaly.
Outliers represent effects/variables that are omitted from your model and if possible need to be identified and accounted for by adding additional predictor series or worst case dummy indicators.*

How to remove a cluster after applying DBSCAN?

Splunk Decoded: Service Maps vs Service Analyzer Tree View vs Flow Maps

What’s New in Splunk Observability – September 2025

Fun with Regular Expression - multiples of nine

Are you a member of the Splunk Community?

How to remove a cluster after applying DBSCAN?

Splunk Decoded: Service Maps vs Service Analyzer Tree View vs Flow Maps

What’s New in Splunk Observability – September 2025

Fun with Regular Expression - multiples of nine