Scenario:
We have a data source of interest that we wish to analyze.
The data source is hourly host activity events.
An endpoint agent installed on a user's host monitors for specific events.
The endpoint agent reports theses events to the central server, aka manager/collector.
Then the central server sends the data/events to Splunk for ingest.
We found that a distinct count of specific action events per hour per host is very interesting to us.
If the hourly count per user is greater than "the normal behavior" average then we want to be alerted.
We define normal behavior as the "90 day average of distinct hourly counts per host/user".
We define an outlier/alert as an hourly distinct count above 2 standard deviations from the 90day hourly average.
For instance, if the 90 day hourly average is 2 events for a host, then 10 events in a single hour for that host would fire an alert.
We tried many different methods and found some anomalies.
One issue is the events' arrival time to Splunk.
Specifically, the data does not always arrive to Splunk in a consistent interval.
The endpoint agent may be delayed in processing or sending the data to the central server if the network connection is lost or the running host was suspended/shutdown shortly after the events of interest occurred. We have accepted this issue as its very infrequent.
Methodology:
In order to conduct our analysis we have multiple phases.
Phase 1 > prepare the data and output to KVstore lookup
We run a query to prime the historic data.
index=foo earliest=-90d@h latest=-1h@h foo_event=* host=*
| timechart span=1h dc(foo_event) as Foo_Count by host limit=0
| untable _time host Foo_Count |outputlookup 90d-Foo_Coun
Then we modify and save the query to append the new data, we use the -2h@h and -1h@h to mitigate lagging events. This report runs first every hour at minute=0.
index=foo earliest=-2@h latest=-1h@h foo_event=* host=*
| timechart span=1h dc(foo_event) as Foo_Count by host limit=0
| untable _time host Foo_Count |outputlookup 90d-Foo_Count append=t
Phase 2 > calculate the upperBound for each user
This report runs second every hour at minute=15. We add additional statistics for investigation purposes.
|inputlookup 90d-Foo_Count |timechart span=1h values(Foo_Count) as Foo_Count by host limit=0 | untable _time host Foo_Count
| stats min(Foo_Count) as Mini max(Foo_Count) as Maxi mean(Foo_Count) as Averg stdev(Foo_Count) as sdev median(Foo_Count) as Med mode(Foo_Count) as Mod range(Foo_Count) as Rng by host
| eval upperBound=(Averg+sdev*exact(2)) | outputlookup Foo_Count-upperBound
Phase 3 > trim the oldest data to maintain a 90d@h interval
This report runs third every hour at minute=30.
|inputlookup 90d-Foo_Count | eval trim_time = relative_time(now(),"-90d@h") | where _time>trim_time | convert ctime(trim_time) |outputlookup 90d-Foo_Count
Phase 4 > detect outliers
This alert runs fourth (last) every hour the minute=45.
index=foo earliest=-1h@h latest=@h foo_event=* host=*
| stats dc(foo_event) as as Foo_Count by host limit=0
| lookup Foo_Count-upperBound host output upperBound | eval isOutlier=if('Foo_Count' > upperBound, 1, 0)
This method is successful alerting on outliers.
RE: event lag, we monitor and keep track of how significant.
Originally, we tried using the MLTK with a DensityFunction and partial fit, however we have approximately 65 million data points which causes issues with the Smart Outlier Detection assistant.
The question is whether anyone has a different or more efficient way to do this?
Thank you for your time!
My thoughts on storing base stdev is about tuning the outlier threshold. I can choose my outlier range in the dashboard, which I may want to be 1.8 or 2.2 or... In practice, if you know that upperbound is stored as 2*stdev, then you can always recalculate anything from that anyway.
But that's all about use case. There's unlikely any significant performance impact in delaying the calculation, but I tend to work more with dashboards for triage than predefined reports, hence my bias.
Accelerated fields can be found here
and they can make a significant difference to lookup performance.
Nice description of your process.
You've solved a key issue of the performance cost of calculating the rolling average if you have lots of data points with the store to kv.
Unless you can fix the lag issue, you have to deal with it as you have done, with the 1h delay.
I have had situations in the past, where a single average over the 90 days is not useful, in that different days/times of day, are significant, so I've used time bins for the lookup.
Are you using accelerated_fields in your kv store on host - if the 65m datapoints are in the kv store, then it may improve lookup times, or at least reduce the impact on the host through the additional index.
I would have also stored the average + stdev in the lookup for the host, so that the 2 * variance can be done in the outlier detection rather than baking the 2* factor into the lookup table, but that may not be useful in your use case.
Thank you for the reply and suggestions.
We will investigate using accelerated_fields. TY!
We currently store a lot of statistical info in the lookup= "Foo_Count-upperBound", including "mean" and "stdev".
RE:
I would have also stored the average + stdev in the lookup for the host, so that the 2 * variance can be done in the outlier detection rather than baking the 2* factor into the lookup table
are you suggesting this to improve performace? or for tuning the threshold of the outlier?
I think I know what you mean, sort of cleans it up more.... TY!
My thoughts on storing base stdev is about tuning the outlier threshold. I can choose my outlier range in the dashboard, which I may want to be 1.8 or 2.2 or... In practice, if you know that upperbound is stored as 2*stdev, then you can always recalculate anything from that anyway.
But that's all about use case. There's unlikely any significant performance impact in delaying the calculation, but I tend to work more with dashboards for triage than predefined reports, hence my bias.
Accelerated fields can be found here
and they can make a significant difference to lookup performance.
Thank you for the reply that makes sense.