In Splunk documentation for the outlier command, it say:
" The transform option truncates the outlying values to the threshold for outliers."
Would like to understand how it calculates the threshold mentioned above.
For this SPL below, the total_bytes value of 92000, is replaced with 000244. How does Splunk come up with the value of 244?
| makeresults
| fields - _time
| eval data="101,20220101,3;101,20220102,200;101,20220103,210;101,20220104,220;101,20220105,200;101,20220106,210;101,20220107,220;101,20220108,92000;101,20220109,200;101,20220110,3;"
| makemv delim=";" data
| mvexpand data | eval splitted = split(data,",") | eval day_hour_key=mvindex(splitted,0,0), date=mvindex(splitted,1,1) , total_bytes=mvindex(splitted,2,2)
| fields day_hour_key,total_bytes,date| outlier action=transform mark=true total_bytes | rename total_bytes as transform_total_bytes
It looks like this is based on the interquartile range (note param option - https://docs.splunk.com/Documentation/SplunkCloud/latest/SearchReference/Outlier...)
You can validate this with this example
| makeresults
| fields - _time
| eval data="101,20220101,3;101,20220102,200;101,20220103,210;101,20220104,220;101,20220105,200;101,20220106,210;101,20220107,220;101,20220108,92000;101,20220109,200;101,20220110,3;"
| makemv delim=";" data
| mvexpand data
| eval splitted = split(data,",")
| eval day_hour_key=mvindex(splitted,0,0), date=mvindex(splitted,1,1) , total_bytes=mvindex(splitted,2,2)
| fields day_hour_key,total_bytes,date
| eventstats perc25(total_bytes) as p25 perc75(total_bytes) as p75
| eval iqr=p75-p25
| eval upper=p75+(iqr*1.5)
| outlier action=transform mark=true total_bytes
It looks like this is based on the interquartile range (note param option - https://docs.splunk.com/Documentation/SplunkCloud/latest/SearchReference/Outlier...)
You can validate this with this example
| makeresults
| fields - _time
| eval data="101,20220101,3;101,20220102,200;101,20220103,210;101,20220104,220;101,20220105,200;101,20220106,210;101,20220107,220;101,20220108,92000;101,20220109,200;101,20220110,3;"
| makemv delim=";" data
| mvexpand data
| eval splitted = split(data,",")
| eval day_hour_key=mvindex(splitted,0,0), date=mvindex(splitted,1,1) , total_bytes=mvindex(splitted,2,2)
| fields day_hour_key,total_bytes,date
| eventstats perc25(total_bytes) as p25 perc75(total_bytes) as p75
| eval iqr=p75-p25
| eval upper=p75+(iqr*1.5)
| outlier action=transform mark=true total_bytes
Thanks alot ITWhisperer. You have increased my understanding a great deal.
| makeresults
| fields - _time
| eval data="101,20220101,3;101,20220102,200;101,20220103,210;101,20220104,220;101,20220105,200;101,20220106,210;101,20220107,220;101,20220108,92000;101,20220109,200;101,20220110,3;"
| makemv delim=";" data
| mvexpand data
| eval splitted = split(data,",")
| eval day_hour_key=mvindex(splitted,0,0), date=mvindex(splitted,1,1) , total_bytes=mvindex(splitted,2,2)
| fields day_hour_key,total_bytes,date
| eventstats perc25(total_bytes) as p25 perc75(total_bytes) as p75
| eval iqr=p75-p25
| eval lower=p25-(iqr*1.5)
| eval upper=p75+(iqr*1.5)
| outlier action=transform param=3 mark=true total_bytes
I am still not sure on the results from outlier though.
Given the above, why are the 2 rows with a value of "3" not flagged as an outlier? I would have thought they would be replaced with "174".
Also, if I put in a param of 3, to override the default of 2.5, how does Splunk come up with the number of "250" to replace the "92000"?
3 is not flagged because you haven't used uselower=t - it defaults to uselower=f
As for why splunk is picking the values it is, to be honest, I don't know - I just found a relationship that worked for your first example.
Personally, if I don't know how something works, I don't usually use it. For all we know, there might be a bug in the calculation - there certainly something that we are missing.
So, my question to you is, why are you using action=transform?
What do you see the value in transforming the outliers rather than just removing them?
Given that we have our own method of generating a replacement value (albeit a different one to that used by splunk except in one instance), why not use something that is known (that's what I would do until I understood what splunk is doing)? 😀
Thanks ITWhisperer.
I have been finding outliers using the that p25 and p75 function to date.
Had created some fairly complex SPL to get the outlier, remove them, and create a baseline for current comparison.
This is so we could find spikes in current data compared to a baseline created from previous 365 days.
It works, mostly. But much room for improvement. To this end I have begun looking at Splunk MLTK to see if I could get better results from it.
I will be diving into "anomalydetection" and "persist" for instance. As I am not a data scientist, I will no doubt be winging it to a large extent. I did want to understand as much as possible what these are doing under the hood. But knew I would have to "trust in the force" to some extent.
If I cannot easily decipher what the outlier command is returning then it is not a good sign for when I dive deeper into MLTK. 😥
Oh well. Crash or crash through as they say. 😀 Thanks again for your insights.
Hi @jeelong,
did you tried the outlier command without options?
| makeresults
| fields - _time
| eval data="101,20220101,3;101,20220102,200;101,20220103,210;101,20220104,220;101,20220105,200;101,20220106,210;101,20220107,220;101,20220108,92000;101,20220109,200;101,20220110,3;"
| makemv delim=";" data
| mvexpand data
| eval splitted = split(data,",")
| eval day_hour_key=mvindex(splitted,0,0), date=mvindex(splitted,1,1) , total_bytes=mvindex(splitted,2,2)
| fields day_hour_key,total_bytes,date
| outlier
| rename total_bytes as transform_total_bytes
Ciao.
Giuseppe