Splunk Search

How to use outlier command?

jeelong
Explorer

In Splunk documentation for the outlier command, it say:

" The transform option truncates the outlying values to the threshold for outliers."

Would like to understand how it calculates the threshold mentioned above. 

For this SPL below, the total_bytes value of 92000, is replaced with 000244. How does Splunk come up with the value of 244?

 

| makeresults
| fields - _time
| eval data="101,20220101,3;101,20220102,200;101,20220103,210;101,20220104,220;101,20220105,200;101,20220106,210;101,20220107,220;101,20220108,92000;101,20220109,200;101,20220110,3;"
| makemv delim=";" data
| mvexpand data | eval splitted = split(data,",") | eval day_hour_key=mvindex(splitted,0,0), date=mvindex(splitted,1,1) , total_bytes=mvindex(splitted,2,2)
| fields day_hour_key,total_bytes,date| outlier action=transform mark=true total_bytes | rename total_bytes as transform_total_bytes

 

 

Labels (1)
Tags (1)
0 Karma
1 Solution

ITWhisperer
SplunkTrust
SplunkTrust

It looks like this is based on the interquartile range (note param option - https://docs.splunk.com/Documentation/SplunkCloud/latest/SearchReference/Outlier...)

You can validate this with this example

| makeresults
| fields - _time
| eval data="101,20220101,3;101,20220102,200;101,20220103,210;101,20220104,220;101,20220105,200;101,20220106,210;101,20220107,220;101,20220108,92000;101,20220109,200;101,20220110,3;"
| makemv delim=";" data
| mvexpand data 
| eval splitted = split(data,",") 
| eval day_hour_key=mvindex(splitted,0,0), date=mvindex(splitted,1,1) , total_bytes=mvindex(splitted,2,2)
| fields day_hour_key,total_bytes,date
| eventstats perc25(total_bytes) as p25 perc75(total_bytes) as p75
| eval iqr=p75-p25
| eval upper=p75+(iqr*1.5)
| outlier action=transform mark=true total_bytes

View solution in original post

ITWhisperer
SplunkTrust
SplunkTrust

It looks like this is based on the interquartile range (note param option - https://docs.splunk.com/Documentation/SplunkCloud/latest/SearchReference/Outlier...)

You can validate this with this example

| makeresults
| fields - _time
| eval data="101,20220101,3;101,20220102,200;101,20220103,210;101,20220104,220;101,20220105,200;101,20220106,210;101,20220107,220;101,20220108,92000;101,20220109,200;101,20220110,3;"
| makemv delim=";" data
| mvexpand data 
| eval splitted = split(data,",") 
| eval day_hour_key=mvindex(splitted,0,0), date=mvindex(splitted,1,1) , total_bytes=mvindex(splitted,2,2)
| fields day_hour_key,total_bytes,date
| eventstats perc25(total_bytes) as p25 perc75(total_bytes) as p75
| eval iqr=p75-p25
| eval upper=p75+(iqr*1.5)
| outlier action=transform mark=true total_bytes

jeelong
Explorer

Thanks alot ITWhisperer. You have increased my understanding a great deal. 

| makeresults
| fields - _time
| eval data="101,20220101,3;101,20220102,200;101,20220103,210;101,20220104,220;101,20220105,200;101,20220106,210;101,20220107,220;101,20220108,92000;101,20220109,200;101,20220110,3;"
| makemv delim=";" data
| mvexpand data
| eval splitted = split(data,",")
| eval day_hour_key=mvindex(splitted,0,0), date=mvindex(splitted,1,1) , total_bytes=mvindex(splitted,2,2)
| fields day_hour_key,total_bytes,date
| eventstats perc25(total_bytes) as p25 perc75(total_bytes) as p75
| eval iqr=p75-p25
| eval lower=p25-(iqr*1.5)
| eval upper=p75+(iqr*1.5)
| outlier action=transform param=3 mark=true total_bytes

I am still not sure on the results from outlier though.

Given the above, why are the 2 rows with a value of "3" not flagged as an outlier? I would have thought they would be replaced with "174".

Also, if I put in a param of 3, to override the default of 2.5, how does Splunk come up with the number of "250"  to replace the "92000"?

 

 

 

Tags (1)
0 Karma

ITWhisperer
SplunkTrust
SplunkTrust

3 is not flagged because you haven't used uselower=t - it defaults to uselower=f

As for why splunk is picking the values it is, to be honest, I don't know - I just found a relationship that worked for your first example.

Personally, if I don't know how something works, I don't usually use it. For all we know, there might be a bug in the calculation - there certainly something that we are missing.

So, my question to you is, why are you using action=transform?

What do you see the value in transforming the outliers rather than just removing them?

Given that we have our own method of generating a replacement value (albeit a different one to that used by splunk except in one instance), why not use something that is known (that's what I would do until I understood what splunk is doing)? 😀

0 Karma

jeelong
Explorer

Thanks ITWhisperer.

I have been finding outliers using the that p25 and p75 function to date. 

Had created some fairly complex SPL to get the outlier, remove them, and create a baseline for current comparison. 

This is so we could find spikes in current data compared to a baseline created from previous 365 days.

It works, mostly. But much room for improvement. To this end I have begun looking at Splunk MLTK to see if I could get better results from it. 

I will be diving into "anomalydetection" and "persist" for instance. As I am not a data scientist, I will no doubt be winging it to a large extent. I did want to understand as much as possible what these are doing under the hood. But knew I would have to "trust in the force" to some extent. 

If I cannot easily decipher what the outlier command is returning then it is not a good sign for when I dive deeper into MLTK. 😥

Oh well. Crash or crash through as they say. 😀 Thanks again for your insights. 

 

 

 

0 Karma

gcusello
SplunkTrust
SplunkTrust

Hi @jeelong,

did you tried the outlier command without options?

| makeresults
| fields - _time
| eval data="101,20220101,3;101,20220102,200;101,20220103,210;101,20220104,220;101,20220105,200;101,20220106,210;101,20220107,220;101,20220108,92000;101,20220109,200;101,20220110,3;"
| makemv delim=";" data
| mvexpand data 
| eval splitted = split(data,",") 
| eval day_hour_key=mvindex(splitted,0,0), date=mvindex(splitted,1,1) , total_bytes=mvindex(splitted,2,2)
| fields day_hour_key,total_bytes,date
| outlier  
| rename total_bytes as transform_total_bytes

Ciao.

Giuseppe

0 Karma
Get Updates on the Splunk Community!

Monitoring Postgres with OpenTelemetry

Behind every business-critical application, you’ll find databases. These behind-the-scenes stores power ...

Mastering Synthetic Browser Testing: Pro Tips to Keep Your Web App Running Smoothly

To start, if you're new to synthetic monitoring, I recommend exploring this synthetic monitoring overview. In ...

Splunk Edge Processor | Popular Use Cases to Get Started with Edge Processor

Splunk Edge Processor offers more efficient, flexible data transformation – helping you reduce noise, control ...