Splunk Search

## Outlier Dip Trough Detection

Path Finder

I am working on time series data and would like to detect these type of trough's in the graphs.   The y axis is network bandwidth and minimum value is 0.

I'm applying the base query time series to a DensityProbability model then with the following SPL for the Outlier chart:

``````| eval leftRange=mvindex(BoundaryRanges,0)
| eval rightRange=mvindex(BoundaryRanges,1)
| rex field=leftRange "Infinity:(?<lowerBound>[^:]*):"
| rex field=rightRange "(?<upperBound>[^:]*):Infinity"
| fields _time, 1/1/g1, lowerBound, upperBound, "IsOutlier(1/1/g1)", *``````

What approach can I take to detect the significant dip in the graph?

Labels (2)

• ### timechart

1 Solution
Builder

Are you fitting your model using stable data without outliers?

Here's an example you can recreate without data:

First, let's a define two macros to generate a bit of Gaussian noise:

``````# macros.conf

[norminv(3)]
args = p,u,s
definition = "exact(\$u\$ + \$s\$ * if(\$p\$ < 0.5, -1 * (sqrt(-2.0 * ln(\$p\$)) - ((0.010328 * sqrt(-2.0 * ln(\$p\$)) + 0.802853) * sqrt(-2.0 * ln(\$p\$)) + 2.515517) / (((0.001308 * sqrt(-2.0 * ln(\$p\$)) + 0.189269) * sqrt(-2.0 * ln(\$p\$)) + 1.432788) * sqrt(-2.0 * ln(\$p\$)) + 1.0)), (sqrt(-2.0 * ln(1 - \$p\$)) - ((0.010328 * sqrt(-2.0 * ln(1 - \$p\$)) + 0.802853) * sqrt(-2.0 * ln(1 - \$p\$)) + 2.515517) / (((0.001308 * sqrt(-2.0 * ln(1 - \$p\$)) + 0.189269) * sqrt(-2.0 * ln(1 - \$p\$)) + 1.432788) * sqrt(-2.0 * ln(1 - \$p\$)) + 1.0))))"
iseval = 1

[rand]
definition = "random()/2147483647"
iseval = 1``````

norminv(3) is similar to the Excel, Matlab, et al. norminv function and returns the inverse of the normal cumulative distribution function with a probability of p, a mean of u, and standard deviation of sp must be greater than 0 and less than 1. The estimator is taken from Abramowitz and Stegun. More precise estimators can be taken from e.g. Odeh and Evans, but this is fine for toys like this.

rand() generates a random number between 0 and 1 using the known range of Splunk's random() function.

Next, let's generate some training data and fit it to a model:

``````| gentimes start=04/30/2021:00:00:00 end=05/01/2021:00:00:00 increment=1m
| eval _time=starttime
| fields + _time
| eval x=`norminv("`rand()`*(0.9999999999999999-0.0000000000000001)+0.0000000000000001", 0.25, 0.05)`
| fit DensityFunction x into simple_gaussian_model``````

We can review the model parameters with the summary command:

``| summary simple_gaussian_model``

 type min max mean std cardinality distance other Auto: Gaussian KDE 0.106517 0.386431 0.251574 0.042535 1440 metric: wasserstein, distance: 0.0010866394106055493 bandwidth: 0.009932847538368504, parameter size: 1440

Very close to our original mean of 0.25 and standard deviation of 0.05!

Finally, let's generate some test data and apply the model:

``````| gentimes start=05/01/2021:00:00:00 end=05/01/2021:09:00:00 increment=1m
| eval _time=starttime
| fields + _time
| eval x=`norminv("`rand()`*(0.9999999999999999-0.0000000000000001)+0.0000000000000001", 0.25, 0.05)`
| append
[| gentimes start=05/01/2021:09:00:00 end=05/01/2021:11:00:00 increment=1m
| eval _time=starttime
| fields + _time
| eval x=`norminv("`rand()`*(0.9999999999999999-0.0000000000000001)+0.0000000000000001", 0.01, 0.005)`]
| append
[| gentimes start=05/01/2021:11:00:00 end=05/01/2021:12:00:00 increment=1m
| eval _time=starttime
| fields + _time
| eval x=`norminv("`rand()`*(0.9999999999999999-0.0000000000000001)+0.0000000000000001", 0.25, 0.05)`]
| apply simple_gaussian_model
| rex field=BoundaryRanges "-Infinity:(?<lcl>[^:]+)"
| rex field=BoundaryRanges "(?<ucl>[^:]+):Infinity"
| table _time x lcl ucl``````

Very nice!

You can find the outliers directly--as you would in an alert search, for example--with a simple where command:

``````| apply simple_gaussian_model
| where 'IsOutlier(x)'==1.0
| table _time x``````

 _time x 2021-05-01 00:41:00 0.379067377 2021-05-01 02:54:00 0.411517318 2021-05-01 03:01:00 0.100776418 2021-05-01 07:18:00 0.131441104 2021-05-01 08:43:00 0.119352555 2021-05-01 08:49:00 0.379070878 2021-05-01 09:00:00 0.017377844 2021-05-01 09:01:00 0.013617436 2021-05-01 09:02:00 0.009148409 . . . . . .
Builder

Are you fitting your model using stable data without outliers?

Here's an example you can recreate without data:

First, let's a define two macros to generate a bit of Gaussian noise:

``````# macros.conf

[norminv(3)]
args = p,u,s
definition = "exact(\$u\$ + \$s\$ * if(\$p\$ < 0.5, -1 * (sqrt(-2.0 * ln(\$p\$)) - ((0.010328 * sqrt(-2.0 * ln(\$p\$)) + 0.802853) * sqrt(-2.0 * ln(\$p\$)) + 2.515517) / (((0.001308 * sqrt(-2.0 * ln(\$p\$)) + 0.189269) * sqrt(-2.0 * ln(\$p\$)) + 1.432788) * sqrt(-2.0 * ln(\$p\$)) + 1.0)), (sqrt(-2.0 * ln(1 - \$p\$)) - ((0.010328 * sqrt(-2.0 * ln(1 - \$p\$)) + 0.802853) * sqrt(-2.0 * ln(1 - \$p\$)) + 2.515517) / (((0.001308 * sqrt(-2.0 * ln(1 - \$p\$)) + 0.189269) * sqrt(-2.0 * ln(1 - \$p\$)) + 1.432788) * sqrt(-2.0 * ln(1 - \$p\$)) + 1.0))))"
iseval = 1

[rand]
definition = "random()/2147483647"
iseval = 1``````

norminv(3) is similar to the Excel, Matlab, et al. norminv function and returns the inverse of the normal cumulative distribution function with a probability of p, a mean of u, and standard deviation of sp must be greater than 0 and less than 1. The estimator is taken from Abramowitz and Stegun. More precise estimators can be taken from e.g. Odeh and Evans, but this is fine for toys like this.

rand() generates a random number between 0 and 1 using the known range of Splunk's random() function.

Next, let's generate some training data and fit it to a model:

``````| gentimes start=04/30/2021:00:00:00 end=05/01/2021:00:00:00 increment=1m
| eval _time=starttime
| fields + _time
| eval x=`norminv("`rand()`*(0.9999999999999999-0.0000000000000001)+0.0000000000000001", 0.25, 0.05)`
| fit DensityFunction x into simple_gaussian_model``````

We can review the model parameters with the summary command:

``| summary simple_gaussian_model``

 type min max mean std cardinality distance other Auto: Gaussian KDE 0.106517 0.386431 0.251574 0.042535 1440 metric: wasserstein, distance: 0.0010866394106055493 bandwidth: 0.009932847538368504, parameter size: 1440

Very close to our original mean of 0.25 and standard deviation of 0.05!

Finally, let's generate some test data and apply the model:

``````| gentimes start=05/01/2021:00:00:00 end=05/01/2021:09:00:00 increment=1m
| eval _time=starttime
| fields + _time
| eval x=`norminv("`rand()`*(0.9999999999999999-0.0000000000000001)+0.0000000000000001", 0.25, 0.05)`
| append
[| gentimes start=05/01/2021:09:00:00 end=05/01/2021:11:00:00 increment=1m
| eval _time=starttime
| fields + _time
| eval x=`norminv("`rand()`*(0.9999999999999999-0.0000000000000001)+0.0000000000000001", 0.01, 0.005)`]
| append
[| gentimes start=05/01/2021:11:00:00 end=05/01/2021:12:00:00 increment=1m
| eval _time=starttime
| fields + _time
| eval x=`norminv("`rand()`*(0.9999999999999999-0.0000000000000001)+0.0000000000000001", 0.25, 0.05)`]
| apply simple_gaussian_model
| rex field=BoundaryRanges "-Infinity:(?<lcl>[^:]+)"
| rex field=BoundaryRanges "(?<ucl>[^:]+):Infinity"
| table _time x lcl ucl``````

Very nice!

You can find the outliers directly--as you would in an alert search, for example--with a simple where command:

``````| apply simple_gaussian_model
| where 'IsOutlier(x)'==1.0
| table _time x``````

 _time x 2021-05-01 00:41:00 0.379067377 2021-05-01 02:54:00 0.411517318 2021-05-01 03:01:00 0.100776418 2021-05-01 07:18:00 0.131441104 2021-05-01 08:43:00 0.119352555 2021-05-01 08:49:00 0.379070878 2021-05-01 09:00:00 0.017377844 2021-05-01 09:01:00 0.013617436 2021-05-01 09:02:00 0.009148409 . . . . . .
Path Finder

Yes, I am fitting the data without outliers.

Initially I was using time slice buckets then fit with the by clause. This produced a range for each minute (I think), so the range kept changing.

``````| eval date_minutebin=strftime(_time, "%M")
| eval date_hour=strftime(_time, "%H")
| eval date_wday=strftime(_time, "%A")
| fit DensityFunction 1/1/g1 by "date_minutebin,date_hour,date_wday" into df_model threshold=0.05 dist=norm``````

For this specific case it was probably not needed, since I needed to look for outliers when the overall bandwidth was reduced (ie. high/low range for the whole data set).  Hopefully this makes sense.

As per your example if I kept it simple without the by clause, I get the desired result.

``| fit DensityFunction 1/1/g1 into df_model dist=norm``

For completeness - in my other data sets running the same fit parameters, is it possible to set lowerBound/lcl to zero since bandwidth cannot be a negative number?

Thank you for explaining how to create the test data. I found that really neat!

Builder

The dist=norm parameter tells the DensityFunction algorithm to use the normal distribution, which has bounds at -Infinity and +Infinity.

I used the default value (dist=auto), and the algorithm selected Gaussian kernel density estimation (dist=gaussian_kde).  I also constrained my mean and standard deviation in a way that would decrease the probability of test samples with values less than 0.

In practice, a normal distribution probably isn't the best fit for your data. We're not doing machine learning here so much as we are basic statistical analysis.

To see the shape of your data, the MLTK includes a histogram macro that works with the histogram visualization, but I prefer the chart command and the bar chart visualization. Just note that chart, bin, etc. produces duplicate bins when working with non-integral spans. I work around that bug with sort and dedup:

``````| gentimes start=04/30/2021:00:00:00 end=05/01/2021:00:00:00 increment=1m
| eval _time=starttime
| fields + _time
| eval x=`norminv("`rand()`*(0.9999999999999999-0.0000000000000001)+0.0000000000000001", 0.25, 0.05)`
| chart count over x span=0.025
| sort - count
| dedup x
| sort x``````

My sample data is normally distributed as expected.

Path Finder

This is what that data set look like.

``````Base query
| fit DensityFunction 1/1/g1 show_density=true
| bin 1/1/g1 bins=100
| stats count avg("ProbabilityDensity(1/1/g1)") as pd by 1/1/g1
| makecontinuous 1/1/g1
| sort 1/1/g1``````

For fit I have reset the value back to auto. dist=auto

For apply I have dropped the -  for the lcl value boundary range.

``````| apply 1/1/g1
| rex field=BoundaryRanges "Infinity:(?<lcl>[^:]+)"
| rex field=BoundaryRanges "(?<ucl>[^:]+):Infinity"
| table _time 1/1/g1 lcl ucl``````

The chart looks good now.

Thank you for the assistance.  It has been really helpful.

Register for .conf21 Now! Go Vegas or Go Virtual!

### How will you .conf21? You decide! Go in-person in Las Vegas, 10/18-10/21, or go online with .conf21 Virtual, 10/19-10/20. Learn More or Register Now >

Get Updates on the Splunk Community!