I am forecasting the number of logins. I have a dataset with the number of logins for each hour.
First, I use LOF (local outlier factor) to find the outliers and then I remove them.
Second, I use Kalman filter to forecast.
But as you can see in the plot, the prediction (blue) is displaced in time from the logins (red) and also from the future confidence interval (green).
So, would it be better to transform the outliers (maybe make them "normal") rather than completely removing them? Because right now I have some gaps in time. For example (the first monday), I have logins , for each hour, from 00h to 15h and the from 18h until the next day. So there is a gap of 2 hours.
@rosho you would need to test several output with your data to confirm whether to include outliers or not. Refer to Splunk Blog on Ensuring Success with Splunk ITSI Adaptive Thresholding Part 3, where it has been mentioned that Quantile is fairly resistant to very large outliers.
In case of predict command do try out older days with
holdback to ensure predicted and actual value are more aligned or not.
That's a good question...as with anything in ML you gotta test and pick what suits your use-case best.
Including outliers for logins might ruin all the "normal" data that you should use for predictions, but again some outliers are recurring and could be used for predictions to avoid false positives.
What happens if you include the outliers are the results affected heavily ?