Solved: How to use outlier command?

jeelong · ‎04-21-2022

In Splunk documentation for the outlier command, it say:

" The transform option truncates the outlying values to the threshold for outliers."

Would like to understand how it calculates the threshold mentioned above.

For this SPL below, the total_bytes value of 92000, is replaced with 000244. How does Splunk come up with the value of 244?

| makeresults
| fields - _time
| eval data="101,20220101,3;101,20220102,200;101,20220103,210;101,20220104,220;101,20220105,200;101,20220106,210;101,20220107,220;101,20220108,92000;101,20220109,200;101,20220110,3;"
| makemv delim=";" data
| mvexpand data | eval splitted = split(data,",") | eval day_hour_key=mvindex(splitted,0,0), date=mvindex(splitted,1,1) , total_bytes=mvindex(splitted,2,2)
| fields day_hour_key,total_bytes,date| outlier action=transform mark=true total_bytes | rename total_bytes as transform_total_bytes

ITWhisperer · ‎04-22-2022

It looks like this is based on the interquartile range (note param option - https://docs.splunk.com/Documentation/SplunkCloud/latest/SearchReference/Outlier...)

You can validate this with this example

| makeresults
| fields - _time
| eval data="101,20220101,3;101,20220102,200;101,20220103,210;101,20220104,220;101,20220105,200;101,20220106,210;101,20220107,220;101,20220108,92000;101,20220109,200;101,20220110,3;"
| makemv delim=";" data
| mvexpand data 
| eval splitted = split(data,",") 
| eval day_hour_key=mvindex(splitted,0,0), date=mvindex(splitted,1,1) , total_bytes=mvindex(splitted,2,2)
| fields day_hour_key,total_bytes,date
| eventstats perc25(total_bytes) as p25 perc75(total_bytes) as p75
| eval iqr=p75-p25
| eval upper=p75+(iqr*1.5)
| outlier action=transform mark=true total_bytes

View solution in original post

ITWhisperer · ‎04-22-2022

It looks like this is based on the interquartile range (note param option - https://docs.splunk.com/Documentation/SplunkCloud/latest/SearchReference/Outlier...)

You can validate this with this example

| makeresults
| fields - _time
| eval data="101,20220101,3;101,20220102,200;101,20220103,210;101,20220104,220;101,20220105,200;101,20220106,210;101,20220107,220;101,20220108,92000;101,20220109,200;101,20220110,3;"
| makemv delim=";" data
| mvexpand data 
| eval splitted = split(data,",") 
| eval day_hour_key=mvindex(splitted,0,0), date=mvindex(splitted,1,1) , total_bytes=mvindex(splitted,2,2)
| fields day_hour_key,total_bytes,date
| eventstats perc25(total_bytes) as p25 perc75(total_bytes) as p75
| eval iqr=p75-p25
| eval upper=p75+(iqr*1.5)
| outlier action=transform mark=true total_bytes

jeelong · ‎04-22-2022

Thanks alot ITWhisperer. You have increased my understanding a great deal.

| makeresults
| fields - _time
| eval data="101,20220101,3;101,20220102,200;101,20220103,210;101,20220104,220;101,20220105,200;101,20220106,210;101,20220107,220;101,20220108,92000;101,20220109,200;101,20220110,3;"
| makemv delim=";" data
| mvexpand data
| eval splitted = split(data,",")
| eval day_hour_key=mvindex(splitted,0,0), date=mvindex(splitted,1,1) , total_bytes=mvindex(splitted,2,2)
| fields day_hour_key,total_bytes,date
| eventstats perc25(total_bytes) as p25 perc75(total_bytes) as p75
| eval iqr=p75-p25
| eval lower=p25-(iqr*1.5)
| eval upper=p75+(iqr*1.5)
| outlier action=transform param=3 mark=true total_bytes

I am still not sure on the results from outlier though.

Given the above, why are the 2 rows with a value of "3" not flagged as an outlier? I would have thought they would be replaced with "174".

Also, if I put in a param of 3, to override the default of 2.5, how does Splunk come up with the number of "250" to replace the "92000"?

ITWhisperer · ‎04-22-2022

3 is not flagged because you haven't used uselower=t - it defaults to uselower=f

As for why splunk is picking the values it is, to be honest, I don't know - I just found a relationship that worked for your first example.

Personally, if I don't know how something works, I don't usually use it. For all we know, there might be a bug in the calculation - there certainly something that we are missing.

So, my question to you is, why are you using action=transform?

What do you see the value in transforming the outliers rather than just removing them?

Given that we have our own method of generating a replacement value (albeit a different one to that used by splunk except in one instance), why not use something that is known (that's what I would do until I understood what splunk is doing)? 😀

jeelong · ‎04-25-2022

Thanks ITWhisperer.

I have been finding outliers using the that p25 and p75 function to date.

Had created some fairly complex SPL to get the outlier, remove them, and create a baseline for current comparison.

This is so we could find spikes in current data compared to a baseline created from previous 365 days.

It works, mostly. But much room for improvement. To this end I have begun looking at Splunk MLTK to see if I could get better results from it.

I will be diving into "anomalydetection" and "persist" for instance. As I am not a data scientist, I will no doubt be winging it to a large extent. I did want to understand as much as possible what these are doing under the hood. But knew I would have to "trust in the force" to some extent.

If I cannot easily decipher what the outlier command is returning then it is not a good sign for when I dive deeper into MLTK. 😥

Oh well. Crash or crash through as they say. 😀 Thanks again for your insights.

gcusello · ‎04-21-2022

Hi @jeelong,

did you tried the outlier command without options?

| makeresults
| fields - _time
| eval data="101,20220101,3;101,20220102,200;101,20220103,210;101,20220104,220;101,20220105,200;101,20220106,210;101,20220107,220;101,20220108,92000;101,20220109,200;101,20220110,3;"
| makemv delim=";" data
| mvexpand data 
| eval splitted = split(data,",") 
| eval day_hour_key=mvindex(splitted,0,0), date=mvindex(splitted,1,1) , total_bytes=mvindex(splitted,2,2)
| fields day_hour_key,total_bytes,date
| outlier  
| rename total_bytes as transform_total_bytes

Ciao.

Giuseppe

How to use outlier command?

stats

Continuing Innovation & New Integrations Unlock Full Stack Observability For Your ...

Monitoring Amazon Elastic Kubernetes Service (EKS)

Cloud Platform & Enterprise: Classic Dashboard Export Feature Deprecation