- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello fellow splunkers,
What I am trying to understand is the purpose of the two continuing stats, I was reading that the first one is for obtaining a general number per day and the other one is for detecting how many times on that day the given user has been detected, but honestly, I had no a success, so I am kind of interested in understanding whether this due to this is like the base of having a good baseline.
Thanks so much,
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @jogonz20
From a Splunk perspective the first stats is preparing the stage for the second stats to create additional statistics from it.
Not an expert in the matter, so bear with me.
For this type of things you often need more than an a count of logins per user so you have to find a way to enrich and/or clean your dataset in a way that it would help you detect the Outliers (a suspicious account behavior or brute force logging in this case), this is known as pre-processing your dataset before wrangling/modeling in ML.
The fist stats does not only counts the amount of times a user has logged in per day it also gives you how many days a user accessed which is later used in the second stats but perhaps it's easy to miss due to a rather complex mix of functions in there.
This is what's happening inside the second stats, just imagine you run this search for the last 30 days.
- num_data_samples: for each user counts how many days of the month logged in.
- recent_count: discretizes user activity in recent events or null if they exceed 24 hours, so you will end up with 2 values for 'count' and the max() function will keep the highest value between the two as 'recent_count'. If recent events are not found then "null".
- avg: it uses the relative_time() function to isolate oldest from recent events and calculates a login average of 'count' between the other 28 days. If previous records are not found then "null".
- stdev: it uses the relative_time() function to isolate oldest from recent events and calculates standard deviation of 'count' for each user between the other 28 days. if previous records are not found then "null".
By doing this you synthesise extra features from the ones you have and end up with more dimensions that will help you find the outliers.
As a sidenote, I believe the null's are missing their (). On the other hand I wouldn't use null() in there, fill the null values with "0" or apply some filtering to remove them. But pay attention to results whichever decision you choose.
I hope it was helpful.
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @jogonz20
From a Splunk perspective the first stats is preparing the stage for the second stats to create additional statistics from it.
Not an expert in the matter, so bear with me.
For this type of things you often need more than an a count of logins per user so you have to find a way to enrich and/or clean your dataset in a way that it would help you detect the Outliers (a suspicious account behavior or brute force logging in this case), this is known as pre-processing your dataset before wrangling/modeling in ML.
The fist stats does not only counts the amount of times a user has logged in per day it also gives you how many days a user accessed which is later used in the second stats but perhaps it's easy to miss due to a rather complex mix of functions in there.
This is what's happening inside the second stats, just imagine you run this search for the last 30 days.
- num_data_samples: for each user counts how many days of the month logged in.
- recent_count: discretizes user activity in recent events or null if they exceed 24 hours, so you will end up with 2 values for 'count' and the max() function will keep the highest value between the two as 'recent_count'. If recent events are not found then "null".
- avg: it uses the relative_time() function to isolate oldest from recent events and calculates a login average of 'count' between the other 28 days. If previous records are not found then "null".
- stdev: it uses the relative_time() function to isolate oldest from recent events and calculates standard deviation of 'count' for each user between the other 28 days. if previous records are not found then "null".
By doing this you synthesise extra features from the ones you have and end up with more dimensions that will help you find the outliers.
As a sidenote, I believe the null's are missing their (). On the other hand I wouldn't use null() in there, fill the null values with "0" or apply some filtering to remove them. But pay attention to results whichever decision you choose.
I hope it was helpful.
