All Apps and Add-ons

Statistical anomalies searches

jwalzerpitt
Influencer

I am running two different searches (Last 7 days) to determine the statistical anomalies for HTTP POSTS within 1 minute. basically looking for any potential anomalous posts that are making past the WAF. The first search uses ML and is as follows:

index=foo
| bucket _time span=1m
| stats count by _time src
| eventstats avg("count") as avg stdev("count") as stdev by "src"
| eval lowerBound=(avg-stdev*exact(2)), upperBound=(avg+stdev*exact(2))
| eval isOutlier=if('count' < lowerBound OR 'count' > upperBound, 1, 0) | splitby("src")
| fields _time, src, "count", lowerBound, upperBound, isOutlier
| where isOutlier=1

The second search is looking for Z scores:

index=foo
| bucket _time span=1m
| stats count by _time src website
| eventstats mean("count") AS mean_count, stdev("count") AS stdev_count
| eval Z_score=round(((count-mean_count)/stdev_count),2)
| where Z_score>1.5 OR Z_score<-1.5
| table _time, src, website, count, mean_count, Z_score
| sort -Z_score

Comparing the results of both searches returns different external IPs. Out of a total of 111 distinct IPs, only 12 IPs overlap between the two searches.

The first question I have is are both searches an apple to apple comparison? The second question I have is, is one search more valid than the other, or is running both searches and looking for IPs that overlap a more concise way to evaluate the statistical anomalies for HTTP posts by external IPs?

Thx

0 Karma

skoelpin
SplunkTrust
SplunkTrust

You have the approach mostly correct but I see a few issues.

First, you need to determine what explanatory variables need to be fed into your target function. If your data follows a cyclic type pattern then most likely _time will be your strongest explanatory variable. Assuming your data is cyclic, you will need to establish a baseline over certain time periods then calculate your boundaries.

Next, you're going to want to make this scalable, so running sub-searches is out of the question. A better approach would be to feed the data into a summary index so you have a 1 day baseline in advance and you can then run a 5-10 minute populating search which will overlay that baseline. This can then trigger visual and email alerts anytime your actual values fall out of "normal"

0 Karma

Anam
Community Manager
Community Manager

@jwalzerpitt

My name is Anam Siddique and I am the Community Content Specialist for Splunk Answers. Please accept the answer if the solution provided by @skoelpin worked for you. We have awesome users who contribute and it would be great if the community can benefit from their answer plus they can get credit/points for their work!

Thanks

Career Survey
First 500 qualified respondents will receive a $20 gift card! Tell us about your professional Splunk journey.
Get Updates on the Splunk Community!

Thanks for the Memories! Splunk University, .conf25, and our Community

Thank you to everyone in the Splunk Community who joined us for .conf25, which kicked off with our iconic ...

Data Persistence in the OpenTelemetry Collector

This blog post is part of an ongoing series on OpenTelemetry. What happens if the OpenTelemetry collector ...

Introducing Splunk 10.0: Smarter, Faster, and More Powerful Than Ever

Now On Demand Whether you're managing complex deployments or looking to future-proof your data ...