Splunk Search

To run Z-score search against email logs, do I need to use summary index or can I get the avg count and then perform a Z-score analysis?

jwalzerpitt
Motivator

I'd like to run some Z-score searches against my email logs, specifically to see outliers that send traffic above their average by a standard deviation (STDEV) of >1.5. Running a Z-score search turns up too many legitimate senders that have a higher output than others (Gmail, Verizon mail, etc).

To help me weed out high volume senders, I was thinking that perhaps I'd need to calculate the average count of each sender per day, week, or month and then do a Z-score to find outliers? (I would also appreciate any suggestions in regards to what to compare the average with - perhaps an integration with timewrap?)

Would I need a summary index for this, or could I do this in one search?

Thx

0 Karma
1 Solution

aljohnson_splun
Splunk Employee
Splunk Employee

@jwalzerpitt Have you checked out the Machine Learning Toolkit? There is an assistant in there that does just this, and has custom visualizations and dashboards to help you in the process.

Aside, yes, there are lots of ways you could do this. I think a per-sendermoving average makes sense. Here is an example for looking for z_scores above 1.5 for a group of hostnames in some proxy logs:

sourcetype=cisco_wsa_squid earliest=-2w
| bin _time span=10m
| stats count by s_hostname, _time 
| streamstats window=6 mean(count) as mu, stdev(count) as sigma by s_hostname
| eval upper_bound = mu + (1.5 * sigma), lower_bound = mu - (1.5 * sigma)
| where count > upper_bound OR count < lower_bound

You could change the window or span or group-by fields to get some other analysis. I'd suggest you check out some of the searches in the ML Toolkit as they have nice examples of using interquartile range or median absolute deviation for doing similar things.

View solution in original post

aljohnson_splun
Splunk Employee
Splunk Employee

@jwalzerpitt Have you checked out the Machine Learning Toolkit? There is an assistant in there that does just this, and has custom visualizations and dashboards to help you in the process.

Aside, yes, there are lots of ways you could do this. I think a per-sendermoving average makes sense. Here is an example for looking for z_scores above 1.5 for a group of hostnames in some proxy logs:

sourcetype=cisco_wsa_squid earliest=-2w
| bin _time span=10m
| stats count by s_hostname, _time 
| streamstats window=6 mean(count) as mu, stdev(count) as sigma by s_hostname
| eval upper_bound = mu + (1.5 * sigma), lower_bound = mu - (1.5 * sigma)
| where count > upper_bound OR count < lower_bound

You could change the window or span or group-by fields to get some other analysis. I'd suggest you check out some of the searches in the ML Toolkit as they have nice examples of using interquartile range or median absolute deviation for doing similar things.

View solution in original post

jwalzerpitt
Motivator

and for clarification, I'm trying to do the avg count on a per sender basis...

0 Karma

cmerriman
Super Champion

are you using anomalydetection for the Z-Score? can you post your syntax at all?

0 Karma
Take the 2021 Splunk Career Survey

Help us learn about how Splunk has
impacted your career by taking the 2021 Splunk Career Survey.

Earn $50 in Amazon cash!