Solved: Clarification on search query to detect outliers

jogonz20 · ‎10-22-2020

Hello fellow splunkers,

I would like to ask you something regarding the function that most of the alerts take to find outliers and so on, I was actually trying to find some information on my own, but I never got a good explanation, basically, I am focusing on understanding the following, let's take as an example this query:

tag=email | search src_user=*@mycompany.com| bucket _time span=1d | stats count by src_user, _time | stats count as num_data_samples max(eval(if(_time >= relative_time(now(), "-1d@d"), 'count',null))) as recent_count avg(eval(if(_time<relative_time(now(),"-1d@d"),'count',null))) as avg stdev(eval(if(_time<relative_time(now(),"-1d@d"),'count',null))) as stdev by src_user| where recent_count> (avg+stdev*2) AND num_data_samples>=7

What I am trying to understand is the purpose of the two continuing stats, I was reading that the first one is for obtaining a general number per day and the other one is for detecting how many times on that day the given user has been detected, but honestly, I had no a success, so I am kind of interested in understanding whether this due to this is like the base of having a good baseline.

Thanks so much,

alemarzu · ‎10-24-2020

Hi @jogonz20
From a Splunk perspective the first stats is preparing the stage for the second stats to create additional statistics from it.

Not an expert in the matter, so bear with me.

For this type of things you often need more than an a count of logins per user so you have to find a way to enrich and/or clean your dataset in a way that it would help you detect the Outliers (a suspicious account behavior or brute force logging in this case), this is known as pre-processing your dataset before wrangling/modeling in ML.

The fist stats does not only counts the amount of times a user has logged in per day it also gives you how many days a user accessed which is later used in the second stats but perhaps it's easy to miss due to a rather complex mix of functions in there.

This is what's happening inside the second stats, just imagine you run this search for the last 30 days.

num_data_samples: for each user counts how many days of the month logged in.
recent_count: discretizes user activity in recent events or null if they exceed 24 hours, so you will end up with 2 values for 'count' and the max() function will keep the highest value between the two as 'recent_count'. If recent events are not found then "null".
avg: it uses the relative_time() function to isolate oldest from recent events and calculates a login average of 'count' between the other 28 days. If previous records are not found then "null".
stdev: it uses the relative_time() function to isolate oldest from recent events and calculates standard deviation of 'count' for each user between the other 28 days. if previous records are not found then "null".

By doing this you synthesise extra features from the ones you have and end up with more dimensions that will help you find the outliers.

As a sidenote, I believe the null's are missing their (). On the other hand I wouldn't use null() in there, fill the null values with "0" or apply some filtering to remove them. But pay attention to results whichever decision you choose.

I hope it was helpful.

View solution in original post

jogonz20 · ‎10-26-2020

Hello @alemarzu

Thanks so much for the explanation, it is now better cleared out for me.

Regards,

alemarzu · ‎10-24-2020

Hi @jogonz20
From a Splunk perspective the first stats is preparing the stage for the second stats to create additional statistics from it.

Not an expert in the matter, so bear with me.

For this type of things you often need more than an a count of logins per user so you have to find a way to enrich and/or clean your dataset in a way that it would help you detect the Outliers (a suspicious account behavior or brute force logging in this case), this is known as pre-processing your dataset before wrangling/modeling in ML.

The fist stats does not only counts the amount of times a user has logged in per day it also gives you how many days a user accessed which is later used in the second stats but perhaps it's easy to miss due to a rather complex mix of functions in there.

This is what's happening inside the second stats, just imagine you run this search for the last 30 days.

num_data_samples: for each user counts how many days of the month logged in.
recent_count: discretizes user activity in recent events or null if they exceed 24 hours, so you will end up with 2 values for 'count' and the max() function will keep the highest value between the two as 'recent_count'. If recent events are not found then "null".
avg: it uses the relative_time() function to isolate oldest from recent events and calculates a login average of 'count' between the other 28 days. If previous records are not found then "null".
stdev: it uses the relative_time() function to isolate oldest from recent events and calculates standard deviation of 'count' for each user between the other 28 days. if previous records are not found then "null".

By doing this you synthesise extra features from the ones you have and end up with more dimensions that will help you find the outliers.

As a sidenote, I believe the null's are missing their (). On the other hand I wouldn't use null() in there, fill the null values with "0" or apply some filtering to remove them. But pay attention to results whichever decision you choose.

I hope it was helpful.

Clarification on search query to detect outliers

correlation search

Earn a $35 Gift Card for Answering our Splunk Admins & App Developer Survey

Continuing Innovation & New Integrations Unlock Full Stack Observability For Your ...

Monitoring Amazon Elastic Kubernetes Service (EKS)