Solved: Re: How can I run a search that will use data from...

viggor · ‎09-28-2017

Given a timeinterval provided by the user, I would like to output those buckets who contain more elements than the average of the 50 non-empty buckets before a bucket.

Is there an easy way of doing this?

DalJeanis · ‎09-28-2017

We can interpret your question a number of ways. Here's one easy way that illustrates what you want to know.

index=foo earliest=-60m@m latest=@m 
| bin _time span=10s
| rename COMMENT as "The above gives you 360 time buckets at 10s each"

| rename COMMENT as "roll up each bin, and calculate the average for up to 50 bins before it"
| stats count as mycount by _time
| streamstats current=f count as recno avg(mycount) as avgcount window=50

| rename COMMENT as "throw away the first 50 bins and any bins less than average"
| where (recno>=50) AND (mycount>avgcount)

Also, please remember that, by definition, about half of all bins are going to contain more events than average. If you get no results from the above, then change to window=51. (It means I was wrong about whether current=f window=50 meant the last 50 or the last 49 events.)

If you are looking for bins that are significantly above average, then you might want to consider using "average plus some number of standard deviations".

| streamstats current=f count as recno avg(mycount) as avgcount stdev(mycount) as stdevcount window=50

| rename COMMENT as "throw away the first 50 bins and any bins less than average plus 2 stdevs"
| where (recno>=50) AND (mycount>avgcount+2*stdevcount)

View solution in original post

DalJeanis · ‎09-28-2017

We can interpret your question a number of ways. Here's one easy way that illustrates what you want to know.

index=foo earliest=-60m@m latest=@m 
| bin _time span=10s
| rename COMMENT as "The above gives you 360 time buckets at 10s each"

| rename COMMENT as "roll up each bin, and calculate the average for up to 50 bins before it"
| stats count as mycount by _time
| streamstats current=f count as recno avg(mycount) as avgcount window=50

| rename COMMENT as "throw away the first 50 bins and any bins less than average"
| where (recno>=50) AND (mycount>avgcount)

Also, please remember that, by definition, about half of all bins are going to contain more events than average. If you get no results from the above, then change to window=51. (It means I was wrong about whether current=f window=50 meant the last 50 or the last 49 events.)

If you are looking for bins that are significantly above average, then you might want to consider using "average plus some number of standard deviations".

| streamstats current=f count as recno avg(mycount) as avgcount stdev(mycount) as stdevcount window=50

| rename COMMENT as "throw away the first 50 bins and any bins less than average plus 2 stdevs"
| where (recno>=50) AND (mycount>avgcount+2*stdevcount)

viggor · ‎09-28-2017

This is a very nice answer, thank you very much.
I have two questions:
1) In your solution, I don't see a condition on the 50 buckets that requires them to be non-empty.
2) related to this, does earliest=-60m@m limit the timerange from which the 50 non-empty buckets are chosen to the last hour? (I could be that there are fewer than 50 such buckets in the last hour.)

DalJeanis · ‎09-28-2017

@viggor - (1) the stats command will not create records unless there is at least one event in the _time range. If you were using a more complicated criteria, you could drop records immediately after the stats.

(2) As part of interpreting the question, we had to make some arbitrary assumptions. One of them was that 360 10-second buckets would be enough to get at least 50 non-empty buckets.

Mostly we were trying to give you a working base search that you could update using whatever assumptions are closer to the characteristics of your actual data.

Happy Splunking!

viggor · ‎09-29-2017

Awesome, thanks a lot. One last question. The first condition in the where statement, i.e., | where (recno>=50), is this to exclude buckets for which there are less than 50 non-empty buckets preceding it? Put differently, if I knew that every bucket always has at least 50 preceding buckets, could I drop that condition?

viggor · ‎09-29-2017

Also on line 7:

| streamstats current=f count as recno avg(mycount) as avgCount stdev(mycount) as stDevCount window=120

shouldn't the average be taken over count instead of over mycount?

DalJeanis · ‎09-29-2017

@viggor - best practices are to always rename the field count, to avoid confusing yourself or splunk. Thus, count in a command always refers to the count being calculated in that current command, and it is never possible to average it or anything else. In this case, mycount is the count created in line 6 for a particular _time bucket.

If you change streamstats to window=120, then you are averaging the last 120 non-zero buckets. Is that what you want?

The first 50 non-zero buckets can never have 50 non-zero buckets before them to average out. You do not HAVE to drop them... normally I would just start the job with earliest= far enough back that I could throw away the first 50 records and still have the range that I was actually looking for.

For instance, if I typically had about 75 nonzero buckets per hour, and if I wanted a 2 hour window, then I would start the window a little more than 40 minutes earlier than my window. (Probably 60 just to be sure.)

As long as you are doing that, then you don't really need to check for the 50. If you are checking against the average that happens to be only 45 events, the difference is not that critical. Just average the events starting at a duration before the desired range that you estimate should usually contain at least 50 nonzero buckets, and then after calculating averages, throw away all buckets before the desired range.

How can I run a search that will use data from buckets from a specific time interval?

Tech Talk Recap | Mastering Threat Hunting

Observability for AI Applications: Troubleshooting Latency

Splunk AI Assistant for SPL vs. ChatGPT: Which One is Better?

Are you a member of the Splunk Community?

How can I run a search that will use data from buckets from a specific time interval?

Tech Talk Recap | Mastering Threat Hunting

Observability for AI Applications: Troubleshooting Latency

Splunk AI Assistant for SPL vs. ChatGPT: Which One is Better?