I am trying to build a query to find outliers using avg and stdev on a perfmon counter but the counter is not a value you can calculate an average and I can't figure out how to create a count of the counter then calculate the avg and stdev. Here is the query I have so far, mostly based on the Splunk Docs Outlier information
index=perfmon collection=HTTP counter=CurrentConnections host=ServerFarm
|eval time=strftime(_time,"%m/%d/%y %H:%M:%S")
|stats avg(counter) as AvgByHost stdev(counter) as StDevbyHost by host, time
|eval LowerBound=(AvgByHost-StDevByHost*2)
|eval UpperBound=(AvgByHost+StDevByHost*2)
|eval isOutlier=if(counter < LowerBound OR counter > UpperBound, 1,0)
|fields time, host, "LowerBound", "UpperBound", "isOutlier"
| sort - isOutlier
|timechart span=1h count(isOutlier) by host
@childroland if you are using Windows Performance Counter monitoring, you should be using the field Value
instead of counter
. Also how frequently are you collecting the performance data? Is it less than a second? You have converted the time to string time with seconds precision. Just wanted to know if you are actually collecting data every second or not. Even if you really needed to bucket time together every second, you should use | bin _time span=1s
and retain time as epoch. Timechart will not work on string time.
Have you tried the following search:
index=perfmon collection=HTTP counter=CurrentConnections host=mp-bts-sa*
| bin _time span=1s
| stats avg(Value) as AvgByHost stdev(Value) as StDevbyHost by host, _time
| eval LowerBound=(AvgByHost-StDevByHost*2)
| eval UpperBound=(AvgByHost+StDevByHost*2)
| eval isOutlier=if(counter < LowerBound OR counter > UpperBound, 1,0)
| timechart span=1h count(isOutlier) by host
PS: | sort - isOutlier
seems irrelevant if you are creating timechart which needs chart sorted by _time.
However, if you are interested in calculating standard deviation based outliers I would suggest you to try out Detect Numerical Outliers Showcase Experiment
from the Machine Learning Toolkit app (requires Python For Scientific Computing add on as a pre-requisite depending on the type of OS). The example lists three algorithms:
Following is an example of Standard Deviation
algorithm which you can try. For sampling I have used the following limits, which you need to adjust as per your use case.
index="perfmon" collection=HTTP counter=CurrentConnections host=mp-bts-sa*
| timechart span=60s avg(Value) as connections
| streamstats window=3600 current=true avg("connections") as avg stdev("connections") as stdev
| eval lowerBound=(avg-stdev*1.5), upperBound=(avg+stdev*1.5)
| eval isOutlier=if('connections' < lowerBound OR 'connections' > upperBound, 1, 0)
| fields _time, "connections", lowerBound, upperBound, isOutlier
span=60s
.window=3600
.1.5
is to get the standard deviation (SD) value somewhere between 1st SD and 2nd SD. If you create chart overlay of isOutlier field you can plot the outliers along with actual value and upper/lower bounds. Refer to one of my recent answer with Chart Overlay and Series Compare (available in Splunk 7.x): https://answers.splunk.com/answers/752153/building-a-dashboard-to-help-tune-our-network-sens.html
Also one of the older answers using Median Absolute Deviation method for outlier detection and plotting through Overlay on an Area Chart (line chart is also possible): https://answers.splunk.com/answers/747177/how-to-add-a-reference-line-to-an-outlier-chart-cr.html
@childroland if you are using Windows Performance Counter monitoring, you should be using the field Value
instead of counter
. Also how frequently are you collecting the performance data? Is it less than a second? You have converted the time to string time with seconds precision. Just wanted to know if you are actually collecting data every second or not. Even if you really needed to bucket time together every second, you should use | bin _time span=1s
and retain time as epoch. Timechart will not work on string time.
Have you tried the following search:
index=perfmon collection=HTTP counter=CurrentConnections host=mp-bts-sa*
| bin _time span=1s
| stats avg(Value) as AvgByHost stdev(Value) as StDevbyHost by host, _time
| eval LowerBound=(AvgByHost-StDevByHost*2)
| eval UpperBound=(AvgByHost+StDevByHost*2)
| eval isOutlier=if(counter < LowerBound OR counter > UpperBound, 1,0)
| timechart span=1h count(isOutlier) by host
PS: | sort - isOutlier
seems irrelevant if you are creating timechart which needs chart sorted by _time.
However, if you are interested in calculating standard deviation based outliers I would suggest you to try out Detect Numerical Outliers Showcase Experiment
from the Machine Learning Toolkit app (requires Python For Scientific Computing add on as a pre-requisite depending on the type of OS). The example lists three algorithms:
Following is an example of Standard Deviation
algorithm which you can try. For sampling I have used the following limits, which you need to adjust as per your use case.
index="perfmon" collection=HTTP counter=CurrentConnections host=mp-bts-sa*
| timechart span=60s avg(Value) as connections
| streamstats window=3600 current=true avg("connections") as avg stdev("connections") as stdev
| eval lowerBound=(avg-stdev*1.5), upperBound=(avg+stdev*1.5)
| eval isOutlier=if('connections' < lowerBound OR 'connections' > upperBound, 1, 0)
| fields _time, "connections", lowerBound, upperBound, isOutlier
span=60s
.window=3600
.1.5
is to get the standard deviation (SD) value somewhere between 1st SD and 2nd SD. If you create chart overlay of isOutlier field you can plot the outliers along with actual value and upper/lower bounds. Refer to one of my recent answer with Chart Overlay and Series Compare (available in Splunk 7.x): https://answers.splunk.com/answers/752153/building-a-dashboard-to-help-tune-our-network-sens.html
Also one of the older answers using Median Absolute Deviation method for outlier detection and plotting through Overlay on an Area Chart (line chart is also possible): https://answers.splunk.com/answers/747177/how-to-add-a-reference-line-to-an-outlier-chart-cr.html
I tried the first query and it did not produce the expected results. I tried timechart but it does not seem to play well with the Outlier query in Splunk Docs and I am just starting in Splunk so I could not get it to work.
I don't think the one second and one minute interval will work; the servers are load balanced but at that rate it would produce hits one even small changes; although you are right 1 hour is a bit long. I was mainly trying to get it to work then I can fine tune the intervals. I did try the queries to see what would happen and I had hits even during periods when all servers on the farm were responding.
I did try the deviations but I could not get them to work and stumbled on the Outlier example in the Splunk Docs.
I will look at your chart examples; I am just getting started so I am not sure if I am up to that challenge yet, Thanks for the tips and examples they helped and provided a better query than what I can up with.
P.S. the sort was more to put hits first when I am reviewing results - you make a good point that it is not needed.
I was able to get it worked out; here is the file query thank you jpolvino your input helped me work through the issue.
index=perfmon collection=HTTP counter=CurrentConnections host=ServerFarm
|bucket _time span=1h
|stats count(counter) as CountByHour by _time, host
|eventstats avg(CountByHour) as AvgByHost stdev(CountByHour) as StDevbyHost by host
|eval LowerBound=(AvgByHost-StDevbyHost*2)
|eval UpperBound=(AvgByHost+StDevbyHost*2)
|eval isOutlier=if(CountByHour < LowerBound OR CountByHour > UpperBound, 1,0)
|fields time, host, "LowerBound", "UpperBound", "isOutlier", "CountByHour", AvgByHost, StDevbyHost
| sort - isOutlier
Please provide some sample field values. Why can an average not be calculated?
The query does list the field value counter=CurrentConnections (the CurrentConnections is the value of the field) which is why I am not able to get an average.
Looks like you are trying to access "counter" after the stats
command. Unfortunately, stats
is a transforming command, so you lose the original fields. If counter refers to the event count, then you can add it to your stats line:
|stats count(counter) AS counter avg(counter) as AvgByHost stdev(counter) as StDevbyHost by host, time
Thank you - that gets me close it now counts the counter value the avg and stdev are not working
Also check your third line: StDevbyHost should match exactly how it is used below (StDevByHost)
thank you. I was able to get the count to work by looking at your example and a few others
index=perfmon collection=HTTP counter=CurrentConnections host=ServerFarm
|bucket _time span=1h
|stats count(counter) as CountByHour by host, _time
|eventstats avg(CountByHour) as AvgByHost stdev(CountByHour) as StDevbyHost by host, _time
|eval LowerBound=(AvgByHost-StDevbyHost*2)
|eval UpperBound=(AvgByHost+StDevbyHost*2)
|eval isOutlier=if(AvgByHost < LowerBound OR AvgByHost > UpperBound, 1,0)
|fields time, host, "LowerBound", "UpperBound", "isOutlier", "CountByHour"
| sort - isOutlier
my isOutlier is not working but I think that is because it is comparing against each server node so I have to work on that
Please clarify: Are you trying to get a count of events per host per hour? And then looking for those hosts that are outside of some sort of criteria?
Sorry I think I should have worded the question better. I am trying to get a count of events by hour and then get the average and standard deviation of that count. Since the field "counter" is not a numeric value just a straight average and standard deviation do not work. I am not sure how to aggregate the event into a number.