Splunk Search

Finding Outliers on event counts

childroland
Explorer

I am trying to build a query to find outliers using avg and stdev on a perfmon counter but the counter is not a value you can calculate an average and I can't figure out how to create a count of the counter then calculate the avg and stdev. Here is the query I have so far, mostly based on the Splunk Docs Outlier information

index=perfmon collection=HTTP counter=CurrentConnections host=ServerFarm
|eval time=strftime(_time,"%m/%d/%y %H:%M:%S")
|stats avg(counter) as AvgByHost stdev(counter) as StDevbyHost by host, time
|eval LowerBound=(AvgByHost-StDevByHost*2)
|eval UpperBound=(AvgByHost+StDevByHost*2)
|eval isOutlier=if(counter < LowerBound OR counter > UpperBound, 1,0)
|fields time, host, "LowerBound", "UpperBound", "isOutlier"
| sort - isOutlier
|timechart span=1h count(isOutlier) by host

0 Karma
1 Solution

niketn
Legend

@childroland if you are using Windows Performance Counter monitoring, you should be using the field Value instead of counter. Also how frequently are you collecting the performance data? Is it less than a second? You have converted the time to string time with seconds precision. Just wanted to know if you are actually collecting data every second or not. Even if you really needed to bucket time together every second, you should use | bin _time span=1s and retain time as epoch. Timechart will not work on string time.

Have you tried the following search:

index=perfmon collection=HTTP counter=CurrentConnections host=mp-bts-sa*
| bin _time span=1s
| stats avg(Value) as AvgByHost stdev(Value) as StDevbyHost by host, _time
| eval LowerBound=(AvgByHost-StDevByHost*2)
| eval UpperBound=(AvgByHost+StDevByHost*2)
| eval isOutlier=if(counter < LowerBound OR counter > UpperBound, 1,0)
| timechart span=1h count(isOutlier) by host

PS: | sort - isOutlier seems irrelevant if you are creating timechart which needs chart sorted by _time.

However, if you are interested in calculating standard deviation based outliers I would suggest you to try out Detect Numerical Outliers Showcase Experiment from the Machine Learning Toolkit app (requires Python For Scientific Computing add on as a pre-requisite depending on the type of OS). The example lists three algorithms:

  1. Standard deviation
  2. Median absolute deviation
  3. Interquartile range

Following is an example of Standard Deviation algorithm which you can try. For sampling I have used the following limits, which you need to adjust as per your use case.

index="perfmon" collection=HTTP counter=CurrentConnections host=mp-bts-sa* 
| timechart span=60s avg(Value) as connections 
| streamstats window=3600 current=true avg("connections") as avg stdev("connections") as stdev 
| eval lowerBound=(avg-stdev*1.5), upperBound=(avg+stdev*1.5)
| eval isOutlier=if('connections' < lowerBound OR 'connections' > upperBound, 1, 0)
| fields _time, "connections", lowerBound, upperBound, isOutlier
  1. average data collected every 60 seconds is taken as input value i.e. span=60s.
  2. a sliding window of 3600 seconds (1 hour) is taken as sliding time interval i.e. window=3600.
  3. a multiplier of 1.5 is to get the standard deviation (SD) value somewhere between 1st SD and 2nd SD.

If you create chart overlay of isOutlier field you can plot the outliers along with actual value and upper/lower bounds. Refer to one of my recent answer with Chart Overlay and Series Compare (available in Splunk 7.x): https://answers.splunk.com/answers/752153/building-a-dashboard-to-help-tune-our-network-sens.html

Also one of the older answers using Median Absolute Deviation method for outlier detection and plotting through Overlay on an Area Chart (line chart is also possible): https://answers.splunk.com/answers/747177/how-to-add-a-reference-line-to-an-outlier-chart-cr.html

____________________________________________
| makeresults | eval message= "Happy Splunking!!!"

View solution in original post

niketn
Legend

@childroland if you are using Windows Performance Counter monitoring, you should be using the field Value instead of counter. Also how frequently are you collecting the performance data? Is it less than a second? You have converted the time to string time with seconds precision. Just wanted to know if you are actually collecting data every second or not. Even if you really needed to bucket time together every second, you should use | bin _time span=1s and retain time as epoch. Timechart will not work on string time.

Have you tried the following search:

index=perfmon collection=HTTP counter=CurrentConnections host=mp-bts-sa*
| bin _time span=1s
| stats avg(Value) as AvgByHost stdev(Value) as StDevbyHost by host, _time
| eval LowerBound=(AvgByHost-StDevByHost*2)
| eval UpperBound=(AvgByHost+StDevByHost*2)
| eval isOutlier=if(counter < LowerBound OR counter > UpperBound, 1,0)
| timechart span=1h count(isOutlier) by host

PS: | sort - isOutlier seems irrelevant if you are creating timechart which needs chart sorted by _time.

However, if you are interested in calculating standard deviation based outliers I would suggest you to try out Detect Numerical Outliers Showcase Experiment from the Machine Learning Toolkit app (requires Python For Scientific Computing add on as a pre-requisite depending on the type of OS). The example lists three algorithms:

  1. Standard deviation
  2. Median absolute deviation
  3. Interquartile range

Following is an example of Standard Deviation algorithm which you can try. For sampling I have used the following limits, which you need to adjust as per your use case.

index="perfmon" collection=HTTP counter=CurrentConnections host=mp-bts-sa* 
| timechart span=60s avg(Value) as connections 
| streamstats window=3600 current=true avg("connections") as avg stdev("connections") as stdev 
| eval lowerBound=(avg-stdev*1.5), upperBound=(avg+stdev*1.5)
| eval isOutlier=if('connections' < lowerBound OR 'connections' > upperBound, 1, 0)
| fields _time, "connections", lowerBound, upperBound, isOutlier
  1. average data collected every 60 seconds is taken as input value i.e. span=60s.
  2. a sliding window of 3600 seconds (1 hour) is taken as sliding time interval i.e. window=3600.
  3. a multiplier of 1.5 is to get the standard deviation (SD) value somewhere between 1st SD and 2nd SD.

If you create chart overlay of isOutlier field you can plot the outliers along with actual value and upper/lower bounds. Refer to one of my recent answer with Chart Overlay and Series Compare (available in Splunk 7.x): https://answers.splunk.com/answers/752153/building-a-dashboard-to-help-tune-our-network-sens.html

Also one of the older answers using Median Absolute Deviation method for outlier detection and plotting through Overlay on an Area Chart (line chart is also possible): https://answers.splunk.com/answers/747177/how-to-add-a-reference-line-to-an-outlier-chart-cr.html

____________________________________________
| makeresults | eval message= "Happy Splunking!!!"

childroland
Explorer

I tried the first query and it did not produce the expected results. I tried timechart but it does not seem to play well with the Outlier query in Splunk Docs and I am just starting in Splunk so I could not get it to work.

I don't think the one second and one minute interval will work; the servers are load balanced but at that rate it would produce hits one even small changes; although you are right 1 hour is a bit long. I was mainly trying to get it to work then I can fine tune the intervals. I did try the queries to see what would happen and I had hits even during periods when all servers on the farm were responding.

I did try the deviations but I could not get them to work and stumbled on the Outlier example in the Splunk Docs.

I will look at your chart examples; I am just getting started so I am not sure if I am up to that challenge yet, Thanks for the tips and examples they helped and provided a better query than what I can up with.

P.S. the sort was more to put hits first when I am reviewing results - you make a good point that it is not needed.

0 Karma

childroland
Explorer

I was able to get it worked out; here is the file query thank you jpolvino your input helped me work through the issue.

index=perfmon collection=HTTP counter=CurrentConnections host=ServerFarm
|bucket _time span=1h
|stats count(counter) as CountByHour by _time, host
|eventstats avg(CountByHour) as AvgByHost stdev(CountByHour) as StDevbyHost by host
|eval LowerBound=(AvgByHost-StDevbyHost*2)
|eval UpperBound=(AvgByHost+StDevbyHost*2)
|eval isOutlier=if(CountByHour < LowerBound OR CountByHour > UpperBound, 1,0)
|fields time, host, "LowerBound", "UpperBound", "isOutlier", "CountByHour", AvgByHost, StDevbyHost
| sort - isOutlier

richgalloway
SplunkTrust
SplunkTrust

Please provide some sample field values. Why can an average not be calculated?

---
If this reply helps you, Karma would be appreciated.
0 Karma

childroland
Explorer

The query does list the field value counter=CurrentConnections (the CurrentConnections is the value of the field) which is why I am not able to get an average.

0 Karma

jpolvino
Builder

Looks like you are trying to access "counter" after the stats command. Unfortunately, stats is a transforming command, so you lose the original fields. If counter refers to the event count, then you can add it to your stats line:

|stats count(counter) AS counter avg(counter) as AvgByHost stdev(counter) as StDevbyHost by host, time
0 Karma

childroland
Explorer

Thank you - that gets me close it now counts the counter value the avg and stdev are not working

0 Karma

jpolvino
Builder

Also check your third line: StDevbyHost should match exactly how it is used below (StDevByHost)

0 Karma

childroland
Explorer

thank you. I was able to get the count to work by looking at your example and a few others

index=perfmon collection=HTTP counter=CurrentConnections host=ServerFarm
|bucket _time span=1h
|stats count(counter) as CountByHour by host, _time
|eventstats avg(CountByHour) as AvgByHost stdev(CountByHour) as StDevbyHost by host, _time
|eval LowerBound=(AvgByHost-StDevbyHost*2)
|eval UpperBound=(AvgByHost+StDevbyHost*2)
|eval isOutlier=if(AvgByHost < LowerBound OR AvgByHost > UpperBound, 1,0)
|fields time, host, "LowerBound", "UpperBound", "isOutlier", "CountByHour"
| sort - isOutlier

my isOutlier is not working but I think that is because it is comparing against each server node so I have to work on that

0 Karma

jpolvino
Builder

Please clarify: Are you trying to get a count of events per host per hour? And then looking for those hosts that are outside of some sort of criteria?

0 Karma

childroland
Explorer

Sorry I think I should have worded the question better. I am trying to get a count of events by hour and then get the average and standard deviation of that count. Since the field "counter" is not a numeric value just a straight average and standard deviation do not work. I am not sure how to aggregate the event into a number.

0 Karma
Get Updates on the Splunk Community!

Now Available: Cisco Talos Threat Intelligence Integrations for Splunk Security Cloud ...

At .conf24, we shared that we were in the process of integrating Cisco Talos threat intelligence into Splunk ...

Preparing your Splunk Environment for OpenSSL3

The Splunk platform will transition to OpenSSL version 3 in a future release. Actions are required to prepare ...

Easily Improve Agent Saturation with the Splunk Add-on for OpenTelemetry Collector

Agent Saturation What and Whys In application performance monitoring, saturation is defined as the total load ...