All Apps and Add-ons

Statistical anomalies searches

jwalzerpitt
Influencer

I am running two different searches (Last 7 days) to determine the statistical anomalies for HTTP POSTS within 1 minute. basically looking for any potential anomalous posts that are making past the WAF. The first search uses ML and is as follows:

index=foo
| bucket _time span=1m
| stats count by _time src
| eventstats avg("count") as avg stdev("count") as stdev by "src"
| eval lowerBound=(avg-stdev*exact(2)), upperBound=(avg+stdev*exact(2))
| eval isOutlier=if('count' < lowerBound OR 'count' > upperBound, 1, 0) | splitby("src")
| fields _time, src, "count", lowerBound, upperBound, isOutlier
| where isOutlier=1

The second search is looking for Z scores:

index=foo
| bucket _time span=1m
| stats count by _time src website
| eventstats mean("count") AS mean_count, stdev("count") AS stdev_count
| eval Z_score=round(((count-mean_count)/stdev_count),2)
| where Z_score>1.5 OR Z_score<-1.5
| table _time, src, website, count, mean_count, Z_score
| sort -Z_score

Comparing the results of both searches returns different external IPs. Out of a total of 111 distinct IPs, only 12 IPs overlap between the two searches.

The first question I have is are both searches an apple to apple comparison? The second question I have is, is one search more valid than the other, or is running both searches and looking for IPs that overlap a more concise way to evaluate the statistical anomalies for HTTP posts by external IPs?

Thx

0 Karma

skoelpin
SplunkTrust
SplunkTrust

You have the approach mostly correct but I see a few issues.

First, you need to determine what explanatory variables need to be fed into your target function. If your data follows a cyclic type pattern then most likely _time will be your strongest explanatory variable. Assuming your data is cyclic, you will need to establish a baseline over certain time periods then calculate your boundaries.

Next, you're going to want to make this scalable, so running sub-searches is out of the question. A better approach would be to feed the data into a summary index so you have a 1 day baseline in advance and you can then run a 5-10 minute populating search which will overlay that baseline. This can then trigger visual and email alerts anytime your actual values fall out of "normal"

0 Karma

Anam
Community Manager
Community Manager

@jwalzerpitt

My name is Anam Siddique and I am the Community Content Specialist for Splunk Answers. Please accept the answer if the solution provided by @skoelpin worked for you. We have awesome users who contribute and it would be great if the community can benefit from their answer plus they can get credit/points for their work!

Thanks

Get Updates on the Splunk Community!

Introducing Splunk Enterprise 9.2

WATCH HERE! Watch this Tech Talk to learn about the latest features and enhancements shipped in the new Splunk ...

Adoption of RUM and APM at Splunk

    Unleash the power of Splunk Observability   Watch Now In this can't miss Tech Talk! The Splunk Growth ...

Routing logs with Splunk OTel Collector for Kubernetes

The Splunk Distribution of the OpenTelemetry (OTel) Collector is a product that provides a way to ingest ...