Solved: Re: outlier command

jeelong · ‎04-21-2022

In Splunk documentation for the outlier command, it say:

" The transform option truncates the outlying values to the threshold for outliers."

Would like to understand how it calculates the threshold mentioned above.

For this SPL below, the total_bytes value of 92000, is replaced with 000244. How does Splunk come up with the value of 244?

| makeresults
| fields - _time
| eval data="101,20220101,3;101,20220102,200;101,20220103,210;101,20220104,220;101,20220105,200;101,20220106,210;101,20220107,220;101,20220108,92000;101,20220109,200;101,20220110,3;"
| makemv delim=";" data
| mvexpand data | eval splitted = split(data,",") | eval day_hour_key=mvindex(splitted,0,0), date=mvindex(splitted,1,1) , total_bytes=mvindex(splitted,2,2)
| fields day_hour_key,total_bytes,date| outlier action=transform mark=true total_bytes | rename total_bytes as transform_total_bytes

ITWhisperer · ‎04-22-2022

It looks like this is based on the interquartile range (note param option - https://docs.splunk.com/Documentation/SplunkCloud/latest/SearchReference/Outlier...)

You can validate this with this example

| makeresults
| fields - _time
| eval data="101,20220101,3;101,20220102,200;101,20220103,210;101,20220104,220;101,20220105,200;101,20220106,210;101,20220107,220;101,20220108,92000;101,20220109,200;101,20220110,3;"
| makemv delim=";" data
| mvexpand data 
| eval splitted = split(data,",") 
| eval day_hour_key=mvindex(splitted,0,0), date=mvindex(splitted,1,1) , total_bytes=mvindex(splitted,2,2)
| fields day_hour_key,total_bytes,date
| eventstats perc25(total_bytes) as p25 perc75(total_bytes) as p75
| eval iqr=p75-p25
| eval upper=p75+(iqr*1.5)
| outlier action=transform mark=true total_bytes

View solution in original post

ITWhisperer · ‎04-22-2022

It looks like this is based on the interquartile range (note param option - https://docs.splunk.com/Documentation/SplunkCloud/latest/SearchReference/Outlier...)

You can validate this with this example

| makeresults
| fields - _time
| eval data="101,20220101,3;101,20220102,200;101,20220103,210;101,20220104,220;101,20220105,200;101,20220106,210;101,20220107,220;101,20220108,92000;101,20220109,200;101,20220110,3;"
| makemv delim=";" data
| mvexpand data 
| eval splitted = split(data,",") 
| eval day_hour_key=mvindex(splitted,0,0), date=mvindex(splitted,1,1) , total_bytes=mvindex(splitted,2,2)
| fields day_hour_key,total_bytes,date
| eventstats perc25(total_bytes) as p25 perc75(total_bytes) as p75
| eval iqr=p75-p25
| eval upper=p75+(iqr*1.5)
| outlier action=transform mark=true total_bytes

jeelong · ‎04-22-2022

Thanks alot ITWhisperer. You have increased my understanding a great deal.

| makeresults
| fields - _time
| eval data="101,20220101,3;101,20220102,200;101,20220103,210;101,20220104,220;101,20220105,200;101,20220106,210;101,20220107,220;101,20220108,92000;101,20220109,200;101,20220110,3;"
| makemv delim=";" data
| mvexpand data
| eval splitted = split(data,",")
| eval day_hour_key=mvindex(splitted,0,0), date=mvindex(splitted,1,1) , total_bytes=mvindex(splitted,2,2)
| fields day_hour_key,total_bytes,date
| eventstats perc25(total_bytes) as p25 perc75(total_bytes) as p75
| eval iqr=p75-p25
| eval lower=p25-(iqr*1.5)
| eval upper=p75+(iqr*1.5)
| outlier action=transform param=3 mark=true total_bytes

I am still not sure on the results from outlier though.

Given the above, why are the 2 rows with a value of "3" not flagged as an outlier? I would have thought they would be replaced with "174".

Also, if I put in a param of 3, to override the default of 2.5, how does Splunk come up with the number of "250" to replace the "92000"?

ITWhisperer · ‎04-22-2022

3 is not flagged because you haven't used uselower=t - it defaults to uselower=f

As for why splunk is picking the values it is, to be honest, I don't know - I just found a relationship that worked for your first example.

Personally, if I don't know how something works, I don't usually use it. For all we know, there might be a bug in the calculation - there certainly something that we are missing.

So, my question to you is, why are you using action=transform?

What do you see the value in transforming the outliers rather than just removing them?

Given that we have our own method of generating a replacement value (albeit a different one to that used by splunk except in one instance), why not use something that is known (that's what I would do until I understood what splunk is doing)? 😀

jeelong · ‎04-25-2022

Thanks ITWhisperer.

I have been finding outliers using the that p25 and p75 function to date.

Had created some fairly complex SPL to get the outlier, remove them, and create a baseline for current comparison.

This is so we could find spikes in current data compared to a baseline created from previous 365 days.

It works, mostly. But much room for improvement. To this end I have begun looking at Splunk MLTK to see if I could get better results from it.

I will be diving into "anomalydetection" and "persist" for instance. As I am not a data scientist, I will no doubt be winging it to a large extent. I did want to understand as much as possible what these are doing under the hood. But knew I would have to "trust in the force" to some extent.

If I cannot easily decipher what the outlier command is returning then it is not a good sign for when I dive deeper into MLTK. 😥

Oh well. Crash or crash through as they say. 😀 Thanks again for your insights.

gcusello · ‎04-21-2022

Hi @jeelong,

did you tried the outlier command without options?

| makeresults
| fields - _time
| eval data="101,20220101,3;101,20220102,200;101,20220103,210;101,20220104,220;101,20220105,200;101,20220106,210;101,20220107,220;101,20220108,92000;101,20220109,200;101,20220110,3;"
| makemv delim=";" data
| mvexpand data 
| eval splitted = split(data,",") 
| eval day_hour_key=mvindex(splitted,0,0), date=mvindex(splitted,1,1) , total_bytes=mvindex(splitted,2,2)
| fields day_hour_key,total_bytes,date
| outlier  
| rename total_bytes as transform_total_bytes

Ciao.

Giuseppe

How to use outlier command?

stats

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Announcing Modern Navigation: A New Era of Splunk User Experience

Modernize your Splunk Apps – Introducing Python 3.13 in Splunk

Step into “Hunt the Insider: An Splunk ES Premier Mystery” to catch a cybercriminal ...

Join the Conversation