I would like to know what approaches to take for detecting patterns in outliers using Splunk. I'm familiar with approaches to detect outliers but would like Splunk to help identify what things are in common to help speed up investigation of outliers. For instance, are there values in any of the fields that are common between the outliers? Or do those values typically exceed a certain threshold?
Thanks!
Brian
@bschaap, You should try out Splunk Machine Learning Toolkit App with Python For Scientific Computing Add On to work.
Machine Learning Toolkit provides several examples to Detect and Analyze Numerical and Categorical Outliers through several Machine Learning algorithms and Standard Outlier Detection mecahnisms. Refer to following documentation and showcase example on Yoututbe.
With your sample/test data you can experiment with thresholds/algorithms and several other critical parameters to ensure that outliers are getting detected as expected. You can capture outlier SPL queries and apply to your own use cases.
http://docs.splunk.com/Documentation/MLApp/latest/User/Showcaseexamples
https://www.youtube.com/watch?v=8POjmd9LYdY&index=5&list=PLxkFdMSHYh3Q1jwpgJJ0ZSnRzZIx2c_KM
Machine Learning Toolkit also provides several visualizations specifically for outlier detection and interpretation: http://docs.splunk.com/Documentation/MLApp/latest/User/Thebasicprocessofmachinelearning
The Splunk ML app uses the predict
command for all time series forecasting. The added benefit of using this app is for the outlier visualization. A better approach would be to take time slices of events over several weeks, and create a range of normal. Once you have this, you can then apply regressors from the ML app to your model
To start, you can use the predict
command and establish an upper and lower bounds to establish what is "normal" and alert on anything outside of the bounds. The limitation to this, is you can't train your data so you have to run a large search each time the predict command runs.
A better approach would be to use relative_time
and use 15 minute spans, then clone and shift your data into their time slots which will allow you to run fast searches over massive data sets without taxing your hardware