Splunk Search

Outlier Dip Trough Detection

pjohnson1
Path Finder

I am working on time series data and would like to detect these type of trough's in the graphs.   The y axis is network bandwidth and minimum value is 0.

I'm applying the base query time series to a DensityProbability model then with the following SPL for the Outlier chart:

 

 

| eval leftRange=mvindex(BoundaryRanges,0)
| eval rightRange=mvindex(BoundaryRanges,1)
| rex field=leftRange "Infinity:(?<lowerBound>[^:]*):"
| rex field=rightRange "(?<upperBound>[^:]*):Infinity"
| fields _time, 1/1/g1, lowerBound, upperBound, "IsOutlier(1/1/g1)", *

 

 

Screen Shot 2021-04-29 at 12.52.31 pm.png

What approach can I take to detect the significant dip in the graph?

Labels (2)
0 Karma
1 Solution

tscroggins
Influencer

@pjohnson1 

Are you fitting your model using stable data without outliers?

Here's an example you can recreate without data:

First, let's a define two macros to generate a bit of Gaussian noise:

 

# macros.conf

[norminv(3)]
args = p,u,s
definition = "exact($u$ + $s$ * if($p$ < 0.5, -1 * (sqrt(-2.0 * ln($p$)) - ((0.010328 * sqrt(-2.0 * ln($p$)) + 0.802853) * sqrt(-2.0 * ln($p$)) + 2.515517) / (((0.001308 * sqrt(-2.0 * ln($p$)) + 0.189269) * sqrt(-2.0 * ln($p$)) + 1.432788) * sqrt(-2.0 * ln($p$)) + 1.0)), (sqrt(-2.0 * ln(1 - $p$)) - ((0.010328 * sqrt(-2.0 * ln(1 - $p$)) + 0.802853) * sqrt(-2.0 * ln(1 - $p$)) + 2.515517) / (((0.001308 * sqrt(-2.0 * ln(1 - $p$)) + 0.189269) * sqrt(-2.0 * ln(1 - $p$)) + 1.432788) * sqrt(-2.0 * ln(1 - $p$)) + 1.0))))"
iseval = 1

[rand]
definition = "random()/2147483647"
iseval = 1

 

norminv(3) is similar to the Excel, Matlab, et al. norminv function and returns the inverse of the normal cumulative distribution function with a probability of p, a mean of u, and standard deviation of sp must be greater than 0 and less than 1. The estimator is taken from Abramowitz and Stegun. More precise estimators can be taken from e.g. Odeh and Evans, but this is fine for toys like this.

rand() generates a random number between 0 and 1 using the known range of Splunk's random() function.

Next, let's generate some training data and fit it to a model:

 

| gentimes start=04/30/2021:00:00:00 end=05/01/2021:00:00:00 increment=1m 
| eval _time=starttime 
| fields + _time 
| eval x=`norminv("`rand()`*(0.9999999999999999-0.0000000000000001)+0.0000000000000001", 0.25, 0.05)`
| fit DensityFunction x into simple_gaussian_model

 

We can review the model parameters with the summary command:

 

| summary simple_gaussian_model

 

typeminmaxmeanstdcardinalitydistanceother
Auto: Gaussian KDE0.1065170.3864310.2515740.0425351440metric: wasserstein, distance: 0.0010866394106055493bandwidth: 0.009932847538368504, parameter size: 1440

Very close to our original mean of 0.25 and standard deviation of 0.05!

Finally, let's generate some test data and apply the model:

 

| gentimes start=05/01/2021:00:00:00 end=05/01/2021:09:00:00 increment=1m 
| eval _time=starttime 
| fields + _time 
| eval x=`norminv("`rand()`*(0.9999999999999999-0.0000000000000001)+0.0000000000000001", 0.25, 0.05)` 
| append 
    [| gentimes start=05/01/2021:09:00:00 end=05/01/2021:11:00:00 increment=1m 
    | eval _time=starttime 
    | fields + _time 
    | eval x=`norminv("`rand()`*(0.9999999999999999-0.0000000000000001)+0.0000000000000001", 0.01, 0.005)`] 
| append 
    [| gentimes start=05/01/2021:11:00:00 end=05/01/2021:12:00:00 increment=1m 
    | eval _time=starttime 
    | fields + _time 
    | eval x=`norminv("`rand()`*(0.9999999999999999-0.0000000000000001)+0.0000000000000001", 0.25, 0.05)`]
| apply simple_gaussian_model
| rex field=BoundaryRanges "-Infinity:(?<lcl>[^:]+)"
| rex field=BoundaryRanges "(?<ucl>[^:]+):Infinity"
| table _time x lcl ucl

 

 outliers.png

 

Very nice!

You can find the outliers directly--as you would in an alert search, for example--with a simple where command:

 

| apply simple_gaussian_model
| where 'IsOutlier(x)'==1.0
| table _time x

 

_timex
2021-05-01 00:41:000.379067377
2021-05-01 02:54:000.411517318
2021-05-01 03:01:000.100776418
2021-05-01 07:18:000.131441104
2021-05-01 08:43:000.119352555
2021-05-01 08:49:000.379070878
2021-05-01 09:00:000.017377844
2021-05-01 09:01:000.013617436
2021-05-01 09:02:000.009148409
..
..
..

View solution in original post

tscroggins
Influencer

@pjohnson1 

Are you fitting your model using stable data without outliers?

Here's an example you can recreate without data:

First, let's a define two macros to generate a bit of Gaussian noise:

 

# macros.conf

[norminv(3)]
args = p,u,s
definition = "exact($u$ + $s$ * if($p$ < 0.5, -1 * (sqrt(-2.0 * ln($p$)) - ((0.010328 * sqrt(-2.0 * ln($p$)) + 0.802853) * sqrt(-2.0 * ln($p$)) + 2.515517) / (((0.001308 * sqrt(-2.0 * ln($p$)) + 0.189269) * sqrt(-2.0 * ln($p$)) + 1.432788) * sqrt(-2.0 * ln($p$)) + 1.0)), (sqrt(-2.0 * ln(1 - $p$)) - ((0.010328 * sqrt(-2.0 * ln(1 - $p$)) + 0.802853) * sqrt(-2.0 * ln(1 - $p$)) + 2.515517) / (((0.001308 * sqrt(-2.0 * ln(1 - $p$)) + 0.189269) * sqrt(-2.0 * ln(1 - $p$)) + 1.432788) * sqrt(-2.0 * ln(1 - $p$)) + 1.0))))"
iseval = 1

[rand]
definition = "random()/2147483647"
iseval = 1

 

norminv(3) is similar to the Excel, Matlab, et al. norminv function and returns the inverse of the normal cumulative distribution function with a probability of p, a mean of u, and standard deviation of sp must be greater than 0 and less than 1. The estimator is taken from Abramowitz and Stegun. More precise estimators can be taken from e.g. Odeh and Evans, but this is fine for toys like this.

rand() generates a random number between 0 and 1 using the known range of Splunk's random() function.

Next, let's generate some training data and fit it to a model:

 

| gentimes start=04/30/2021:00:00:00 end=05/01/2021:00:00:00 increment=1m 
| eval _time=starttime 
| fields + _time 
| eval x=`norminv("`rand()`*(0.9999999999999999-0.0000000000000001)+0.0000000000000001", 0.25, 0.05)`
| fit DensityFunction x into simple_gaussian_model

 

We can review the model parameters with the summary command:

 

| summary simple_gaussian_model

 

typeminmaxmeanstdcardinalitydistanceother
Auto: Gaussian KDE0.1065170.3864310.2515740.0425351440metric: wasserstein, distance: 0.0010866394106055493bandwidth: 0.009932847538368504, parameter size: 1440

Very close to our original mean of 0.25 and standard deviation of 0.05!

Finally, let's generate some test data and apply the model:

 

| gentimes start=05/01/2021:00:00:00 end=05/01/2021:09:00:00 increment=1m 
| eval _time=starttime 
| fields + _time 
| eval x=`norminv("`rand()`*(0.9999999999999999-0.0000000000000001)+0.0000000000000001", 0.25, 0.05)` 
| append 
    [| gentimes start=05/01/2021:09:00:00 end=05/01/2021:11:00:00 increment=1m 
    | eval _time=starttime 
    | fields + _time 
    | eval x=`norminv("`rand()`*(0.9999999999999999-0.0000000000000001)+0.0000000000000001", 0.01, 0.005)`] 
| append 
    [| gentimes start=05/01/2021:11:00:00 end=05/01/2021:12:00:00 increment=1m 
    | eval _time=starttime 
    | fields + _time 
    | eval x=`norminv("`rand()`*(0.9999999999999999-0.0000000000000001)+0.0000000000000001", 0.25, 0.05)`]
| apply simple_gaussian_model
| rex field=BoundaryRanges "-Infinity:(?<lcl>[^:]+)"
| rex field=BoundaryRanges "(?<ucl>[^:]+):Infinity"
| table _time x lcl ucl

 

 outliers.png

 

Very nice!

You can find the outliers directly--as you would in an alert search, for example--with a simple where command:

 

| apply simple_gaussian_model
| where 'IsOutlier(x)'==1.0
| table _time x

 

_timex
2021-05-01 00:41:000.379067377
2021-05-01 02:54:000.411517318
2021-05-01 03:01:000.100776418
2021-05-01 07:18:000.131441104
2021-05-01 08:43:000.119352555
2021-05-01 08:49:000.379070878
2021-05-01 09:00:000.017377844
2021-05-01 09:01:000.013617436
2021-05-01 09:02:000.009148409
..
..
..

pjohnson1
Path Finder

Yes, I am fitting the data without outliers.

Initially I was using time slice buckets then fit with the by clause. This produced a range for each minute (I think), so the range kept changing.

Screen Shot 2021-05-03 at 9.55.38 pm.png

 

 

| eval date_minutebin=strftime(_time, "%M")
| eval date_hour=strftime(_time, "%H")
| eval date_wday=strftime(_time, "%A")
| fit DensityFunction 1/1/g1 by "date_minutebin,date_hour,date_wday" into df_model threshold=0.05 dist=norm

 

For this specific case it was probably not needed, since I needed to look for outliers when the overall bandwidth was reduced (ie. high/low range for the whole data set).  Hopefully this makes sense.

 

As per your example if I kept it simple without the by clause, I get the desired result.

Screen Shot 2021-05-03 at 9.40.57 pm.png

 

 

| fit DensityFunction 1/1/g1 into df_model dist=norm

 

 

For completeness - in my other data sets running the same fit parameters, is it possible to set lowerBound/lcl to zero since bandwidth cannot be a negative number? 

Screen Shot 2021-05-03 at 10.04.11 pm.png

Thank you for explaining how to create the test data. I found that really neat!

0 Karma

tscroggins
Influencer

@pjohnson1 

The dist=norm parameter tells the DensityFunction algorithm to use the normal distribution, which has bounds at -Infinity and +Infinity.

I used the default value (dist=auto), and the algorithm selected Gaussian kernel density estimation (dist=gaussian_kde).  I also constrained my mean and standard deviation in a way that would decrease the probability of test samples with values less than 0.

In practice, a normal distribution probably isn't the best fit for your data. We're not doing machine learning here so much as we are basic statistical analysis.

To see the shape of your data, the MLTK includes a histogram macro that works with the histogram visualization, but I prefer the chart command and the bar chart visualization. Just note that chart, bin, etc. produces duplicate bins when working with non-integral spans. I work around that bug with sort and dedup:

| gentimes start=04/30/2021:00:00:00 end=05/01/2021:00:00:00 increment=1m 
| eval _time=starttime 
| fields + _time 
| eval x=`norminv("`rand()`*(0.9999999999999999-0.0000000000000001)+0.0000000000000001", 0.25, 0.05)`
| chart count over x span=0.025
| sort - count
| dedup x
| sort x

tscroggins_0-1620086426954.png

My sample data is normally distributed as expected.

pjohnson1
Path Finder

This is what that data set look like.

 

Base query
| fit DensityFunction 1/1/g1 show_density=true
| bin 1/1/g1 bins=100
| stats count avg("ProbabilityDensity(1/1/g1)") as pd by 1/1/g1
| makecontinuous 1/1/g1
| sort 1/1/g1

 

 

Screen Shot 2021-05-04 at 11.03.54 am.png

For fit I have reset the value back to auto. dist=auto

For apply I have dropped the -  for the lcl value boundary range.

 

| apply 1/1/g1
| rex field=BoundaryRanges "Infinity:(?<lcl>[^:]+)"
| rex field=BoundaryRanges "(?<ucl>[^:]+):Infinity"
| table _time 1/1/g1 lcl ucl

 

The chart looks good now.

Screen Shot 2021-05-04 at 11.15.23 am.png

Thank you for the assistance.  It has been really helpful.

 

Get Updates on the Splunk Community!

Enterprise Security Content Update (ESCU) | New Releases

In December, the Splunk Threat Research Team had 1 release of new security content via the Enterprise Security ...

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

(This is the first of a series of 2 blogs). Splunk Enterprise Security is a fantastic tool that offers robust ...

Index This | What are the 12 Days of Splunk-mas?

December 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...