I'd like to run some Z-score searches against my email logs, specifically to see outliers that send traffic above their average by a standard deviation (STDEV) of >1.5. Running a Z-score search turns up too many legitimate senders that have a higher output than others (Gmail, Verizon mail, etc).
To help me weed out high volume senders, I was thinking that perhaps I'd need to calculate the average count of each sender per day, week, or month and then do a Z-score to find outliers? (I would also appreciate any suggestions in regards to what to compare the average with - perhaps an integration with timewrap?)
Would I need a summary index for this, or could I do this in one search?
Thx
@jwalzerpitt Have you checked out the Machine Learning Toolkit? There is an assistant in there that does just this, and has custom visualizations and dashboards to help you in the process.
Aside, yes, there are lots of ways you could do this. I think a per-sendermoving average makes sense. Here is an example for looking for z_scores above 1.5 for a group of hostnames in some proxy logs:
sourcetype=cisco_wsa_squid earliest=-2w
| bin _time span=10m
| stats count by s_hostname, _time
| streamstats window=6 mean(count) as mu, stdev(count) as sigma by s_hostname
| eval upper_bound = mu + (1.5 * sigma), lower_bound = mu - (1.5 * sigma)
| where count > upper_bound OR count < lower_bound
You could change the window or span or group-by fields to get some other analysis. I'd suggest you check out some of the searches in the ML Toolkit as they have nice examples of using interquartile range or median absolute deviation for doing similar things.
@jwalzerpitt Have you checked out the Machine Learning Toolkit? There is an assistant in there that does just this, and has custom visualizations and dashboards to help you in the process.
Aside, yes, there are lots of ways you could do this. I think a per-sendermoving average makes sense. Here is an example for looking for z_scores above 1.5 for a group of hostnames in some proxy logs:
sourcetype=cisco_wsa_squid earliest=-2w
| bin _time span=10m
| stats count by s_hostname, _time
| streamstats window=6 mean(count) as mu, stdev(count) as sigma by s_hostname
| eval upper_bound = mu + (1.5 * sigma), lower_bound = mu - (1.5 * sigma)
| where count > upper_bound OR count < lower_bound
You could change the window or span or group-by fields to get some other analysis. I'd suggest you check out some of the searches in the ML Toolkit as they have nice examples of using interquartile range or median absolute deviation for doing similar things.
and for clarification, I'm trying to do the avg count on a per sender basis...
are you using anomalydetection for the Z-Score? can you post your syntax at all?