Splunk Search

To run Z-score search against email logs, do I need to use summary index or can I get the avg count and then perform a Z-score analysis?

jwalzerpitt
Influencer

I'd like to run some Z-score searches against my email logs, specifically to see outliers that send traffic above their average by a standard deviation (STDEV) of >1.5. Running a Z-score search turns up too many legitimate senders that have a higher output than others (Gmail, Verizon mail, etc).

To help me weed out high volume senders, I was thinking that perhaps I'd need to calculate the average count of each sender per day, week, or month and then do a Z-score to find outliers? (I would also appreciate any suggestions in regards to what to compare the average with - perhaps an integration with timewrap?)

Would I need a summary index for this, or could I do this in one search?

Thx

0 Karma
1 Solution

aljohnson_splun
Splunk Employee
Splunk Employee

@jwalzerpitt Have you checked out the Machine Learning Toolkit? There is an assistant in there that does just this, and has custom visualizations and dashboards to help you in the process.

Aside, yes, there are lots of ways you could do this. I think a per-sendermoving average makes sense. Here is an example for looking for z_scores above 1.5 for a group of hostnames in some proxy logs:

sourcetype=cisco_wsa_squid earliest=-2w
| bin _time span=10m
| stats count by s_hostname, _time 
| streamstats window=6 mean(count) as mu, stdev(count) as sigma by s_hostname
| eval upper_bound = mu + (1.5 * sigma), lower_bound = mu - (1.5 * sigma)
| where count > upper_bound OR count < lower_bound

You could change the window or span or group-by fields to get some other analysis. I'd suggest you check out some of the searches in the ML Toolkit as they have nice examples of using interquartile range or median absolute deviation for doing similar things.

View solution in original post

aljohnson_splun
Splunk Employee
Splunk Employee

@jwalzerpitt Have you checked out the Machine Learning Toolkit? There is an assistant in there that does just this, and has custom visualizations and dashboards to help you in the process.

Aside, yes, there are lots of ways you could do this. I think a per-sendermoving average makes sense. Here is an example for looking for z_scores above 1.5 for a group of hostnames in some proxy logs:

sourcetype=cisco_wsa_squid earliest=-2w
| bin _time span=10m
| stats count by s_hostname, _time 
| streamstats window=6 mean(count) as mu, stdev(count) as sigma by s_hostname
| eval upper_bound = mu + (1.5 * sigma), lower_bound = mu - (1.5 * sigma)
| where count > upper_bound OR count < lower_bound

You could change the window or span or group-by fields to get some other analysis. I'd suggest you check out some of the searches in the ML Toolkit as they have nice examples of using interquartile range or median absolute deviation for doing similar things.

jwalzerpitt
Influencer

and for clarification, I'm trying to do the avg count on a per sender basis...

0 Karma

cmerriman
Super Champion

are you using anomalydetection for the Z-Score? can you post your syntax at all?

0 Karma
Get Updates on the Splunk Community!

New Case Study Shows the Value of Partnering with Splunk Academic Alliance

The University of Nevada, Las Vegas (UNLV) is another premier research institution helping to shape the next ...

How to Monitor Google Kubernetes Engine (GKE)

We’ve looked at how to integrate Kubernetes environments with Splunk Observability Cloud, but what about ...

Index This | How can you make 45 using only 4?

October 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with this ...