Does anybody know what is the purpose of this form...

jogonz20 · ‎10-11-2020

Hello splunkers,

While checking some use cases I found out one that I am interested of "Detect Spike in Network ACL activity", my question is about the formula that it contains to detect the suspicious activity, here is the query it is based on:

sourcetype=aws:cloudtrail `network_acl_events` [search sourcetype=aws:cloudtrail `network_acl_events` | spath output=arn path=userIdentity.arn | stats count as apiCalls by arn | inputlookup network_acl_activity_baseline append=t | fields - latestCount | stats values(*) as * by arn | rename apiCalls as latestCount | eval newAvgApiCalls=avgApiCalls + (latestCount-avgApiCalls)/720 | eval newStdevApiCalls=sqrt(((pow(stdevApiCalls, 2)*719 + (latestCount-newAvgApiCalls)*(latestCount-avgApiCalls))/720)) | eval avgApiCalls=coalesce(newAvgApiCalls, avgApiCalls), stdevApiCalls=coalesce(newStdevApiCalls, stdevApiCalls), numDataPoints=if(isnull(latestCount), numDataPoints, numDataPoints+1) | table arn, latestCount, numDataPoints, avgApiCalls, stdevApiCalls | outputlookup network_acl_activity_baseline | eval dataPointThreshold = 15, deviationThreshold = 3 | eval isSpike=if((latestCount > avgApiCalls+deviationThreshold*stdevApiCalls) AND numDataPoints > dataPointThreshold, 1, 0) | where isSpike=1 | rename arn as userIdentity.arn | table userIdentity.arn] | spath output=user userIdentity.arn | stats values(eventName) as eventNames, count as numberOfApiCalls, dc(eventName) as uniqueApisCalled by user

I understand pretty much all but I do not understand what this part does:

| eval newAvgApiCalls=avgApiCalls + (latestCount-avgApiCalls)/720 | eval newStdevApiCalls=sqrt(((pow(stdevApiCalls, 2)*719 + (latestCount-newAvgApiCalls)*(latestCount-avgApiCalls))/720))

Specifically where the 720 and 719 come from, so my question for you is does anybody have worked with it before or any similar one because I noticed there are others which use the same formula.

I am using Splunk ES version 6.1.1

Thanks so much,

brandonvu · ‎04-15-2022

If you look at the AWS documentation and notes found in the references below, it will tell you "Some commands, parameters, and field names in the searches below may need to be adjusted to match your environment". You can also tell that multiple use cases share the same logic with the 719 and 720.

To answer your question, I reverse engineered the use case and you can tell that it is simply using the standard algorithm with two deltas over "N". Knowing this, it means N=720 is the "number of sample results", aka DataPoints, in the dataset that is being examined. 719 is actually "N-1" which is just applying the Bessel's Correction which is used to correct the bias in the estimation of the population variance.

So basically, you are suppose to replace the N and the N-1 with the number of DataPoints in your actual search and not use 719 or 720.

Also note, you can see that multiple use cases for AWS use the same logic in the references I listed below. This is likely because I know you can build use cases on top of the same logic but just change your main search and conditions a little bit to meet your criteria. With that said, it is likely that these use cases were tested against the same dataset hence they all have the 719 and 720 in the logic.

AWS References:

Algorithm References:

Richfez · ‎10-12-2020

Oooh oooh oooh, I don't *know* the answer but I want to guess really badly.

My guess is that it's trying to keep a 12 hour history of 1 minute data points, (or possibly 12 minutes of 1 second ones). I'm going to use the 12 minutes:1 second ratio below, but EITHER is exactly as valid as the other. I suspect you'll know the answer as to which one it is by looking to see if the search is scheduled to run every minute (the short side of the long period possibility), or every 12 minutes (the long side of the shorter period possibility).

So why 720 in some places, 719 in others?

I believe from inspecting it that the idea is perhaps to discount the N, since it's counted as the latestCount entry.

We first build a newAvgApiCalls, which is our current stored value for avgApiCalls from the lookup plus 1/720th of the latest calculated count.

Then we create a new stdDev value by taking the square root of:

719 copies of the existing standard deviation of api calls squared,
plus 1/720th of the new count... but this gets a little squirrely in here.

I'm not convinced this is mathematically sound. And I still haven't discovered how the list is trimmed to 720, and if it's not trimmed to 720, then really this all starts getting weirder as time passes and it starts acting like a weighting function on ... the standard deviation. Interesting concept.

If nothing else, I wonder about the units. It seems like we're doing a lot of semi-circular duplicated logic inside there.

If I substitute a bit. We start with

    | eval newAvgApiCalls=avgApiCalls + (latestCount-avgApiCalls)/720 
    | eval newStdevApiCalls=sqrt(((pow(stdevApiCalls, 2)*719 + (latestCount-newAvgApiCalls)*(latestCount-avgApiCalls))/720))

OK, let's make some names shorter and more clear and call avgApiCalls AVG, and it's a core thing we have. And LatestCount is LATEST, it's also ground truth.

newAvgApiCalls will be NEWAVG.

STD and NEWSTD are our remaining, shorter, variables. Substituting once,

    | eval NEWAVG= AVG + (LATEST - AVG)/720 
    | eval NEWSTD=sqrt(((pow(STD, 2)*719 + (LATEST-NEWAVG)*(LATEST-AVG))/720))

And then substituting once more because NEWAVG is in that second line, so let's put in what NEWAVG is (that equation):

    | eval NEWAVG= AVG + (LATEST - AVG)/720 
    | eval NEWSTD=sqrt(((pow(STD, 2)*719 + 
          (LATEST-(AVG + (LATEST - AVG)/720))*(LATEST-AVG))/720))

And that just ...

doesn't look right? Though at least there's now a "/720" on both sides of that * in the last line, so that seems more reasonable for at least the units on that line. What I still don't get it why you have to multiple one side by 719 and divide the other by 720, when my gut tells me maybe it should either be *719 and *1, or it should be *1 and /720 for the two sides. One side must be a rollup already. hmm

I may dissect this farther and examine how it works on some cobbled up CSV data I can calculate manually to test with more carefully.

Anyway, without regard to my curiousity which I may or may not have time to dig into much farther tomorrow some time, I hope at least my conjecture makes sense to you and fills in enough gap that it's a BIT more clear. A bit.

Anyway, happy Splunking!

-Rich

Does anybody know what is the purpose of this formula in one splunk use case from the content library

administration

correlation search

using Enterprise Security

Introducing Splunk Enterprise 9.2

Adoption of RUM and APM at Splunk

Routing logs with Splunk OTel Collector for Kubernetes