Hi
I am working on a forecasting problem.
I want to use DBSCAN to detect outliers and then apply Kalman filter to make forecasts.
But I do not know how to remove or transform the samples inside a cluster.
How can I connect these 2 "algorithms"
#THIS IS TO APPLY DBSCAN
| inputlookup fortigate_QC_May2019_logins.csv
| fit StandardScaler "logins" with_mean=false with_std=true
| fit DBSCAN eps=0.2 "SS_logins"
#THIS IS TO FORECAST WITH KALMAN FILTER
| predict "logins" as prediction algorithm=LLP5 holdback=288 future_timespan=324 upper95=upper95 lower95=lower95
| `forecastviz(324, 288, "logins", 95)`
| where prediction!="" AND 'logins' != ""
| `regressionstatistics("logins", prediction)`
Thank you
This is the SPL:
|fit DBSCAN eps=0.6 "SS_logins"
|where NOT cluster==-1
| predict "SS_logins" as prediction algorithm=LLP holdback=288 future_timespan=324 upper95=upper95 lower95=lower95
|forecastviz(324, 288, "SS_logins", 95)
The 2nd line is how I remove the clusters.
This is the SPL:
|fit DBSCAN eps=0.6 "SS_logins"
|where NOT cluster==-1
| predict "SS_logins" as prediction algorithm=LLP holdback=288 future_timespan=324 upper95=upper95 lower95=lower95
|forecastviz(324, 288, "SS_logins", 95)
The 2nd line is how I remove the clusters.
Hi rosho,
let's assume your outlier detected by DBSCAN are marked with a cluster=-1 then you can easily exclude them from your search results of the first part of your search by filtering with | where cluster>-1. Subsequently you can run your forecasting part.
However I would recommend to you to have equidistant timestamps e.g. by using a | timechart command before your forecasting part to have a proper input for many forecasting algorithms. You might also think of filling the gaps with imputed values for the sake of training your forecasting model on your "cleaned" assumptions. You might find the Imputer useful here: https://docs.splunk.com/Documentation/MLApp/4.3.0/User/Algorithms#Imputer
Instead of | predict I would also highly recommend to you to have a look at the StateSpaceForecast algorithm newly introduced in the MLTK 4.2: https://docs.splunk.com/Documentation/MLApp/4.3.0/User/Algorithms#StateSpaceForecast
You might find this blog useful that explains it with an example: https://www.splunk.com/blog/2019/03/20/what-s-new-in-the-splunk-machine-learning-toolkit-4-2.html
Hope this is helpful to you?
Is "intervention detection" the same as the "Imputer"?
I would replace packets of contiguous missing values with hourly averages around the missing values. If the values are not missing but are anomalous either manually adjust them or estimate what they should have been via **Intervention Detection* which is essentially a forward prediction/fitted value for an anomaly.
Outliers represent effects/variables that are omitted from your model and if possible need to be identified and accounted for by adding additional predictor series or worst case dummy indicators.*