Getting Amount of X Events After Last Y Event

PaintItParker · ‎05-06-2021

Right now I have something like this:

index=my_index sourcetype=my_sourcetype
| rex field=message "- (?<User>\S+) -:"
| rex field=message "- (?<MessageInfo>\S+) :"
| eval Err=if(match(MessageInfo, "(Error example 1)|(Error example 2)"), 1, 0)
| eval Succ=if(match(MessageInfo, "(Success example 1)|(Success example 2)"), 1, 0)
| stats sum(Err) as ErrCount, sum(Succ) as SuccCount by User
| table User, ErrCount, SuccCount

So, ErrCount gets the total count of errors for each User. However, I am writing an alert, and we only want to be alerted if there have been 10 or more errors since the last success - over a 4 hour time range.

So basically:

1. By User, look at the last success message that occurred in the 4 hour time range

2. If 10 or more errors occurred since the last success message, set a flag for the User - only Users with a flag set are tabled

3. Table User and the amount of errors that occurred since the final success

Is this at all possible? How could I start to go about it? I am lost on how to get the last success message and then use that to get the quantity of errors since.

ITWhisperer · ‎05-08-2021

If you only want to count errors from the last success and are not concerned about error between successes, you can determine the time of the last success and only count errors if the time is later.

index=my_index sourcetype=my_sourcetype
| rex field=message "- (?<User>\S+) -:"
| rex field=message "- (?<MessageInfo>\S+) :"
| eval Err=if(match(MessageInfo, "(Error example 1)|(Error example 2)"), 1, 0)
| eval Succ=if(match(MessageInfo, "(Success example 1)|(Success example 2)"), 1, 0)
| eval SuccTime=if(Succ==1,_time,null)
| eventstats max(SuccTime) as SuccTime by User
| eval Err=if(isnotnull(Err),if(_time>SuccTime,Err,null),null)
| stats sum(Err) as ErrCount, sum(Succ) as SuccCount by User
| table User, ErrCount, SuccCount
| where ErrCount>9

This doesn't require a sort so may be slightly better for large data sets.

tscroggins · ‎05-08-2021

We could also do some interesting things with ratios.

Using splunkd logs (my favorite source of sample data):

Global ratio of successes to errors less than 1:9 (if errors is 0, the quotient is null):

index=_internal sourcetype=splunkd source=*/splunkd.log*
| eval error=if(match(log_level, "ERROR"), 1, 0), success=if(NOT match(log_level, "ERROR"), 1, 0)
| stats sum(error) as errors sum(success) as successes by component
| where successes/errors<1/9

Moving ratio of successes to errors less than 1:9:

index=_internal sourcetype=splunkd source=*/splunkd.log*
| streamstats global=f window=10 count(eval(match(log_level, "ERROR"))) as errors count(eval(NOT match(log_level, "ERROR"))) as successes by component
| where successes/errors<1/9

A ratio of weighted moving averages would put more emphasis on the latest events. There's a century of prior art in this space, but prior to the advent of machine learning, I don't think the basics changed much.

tscroggins · ‎05-08-2021

@PaintItParker

This may work for you as a framework, although using streamstats to reset counts in the correct chronological order requires sorting events, which isn't optimal over large result sets:

index=my_index sourcetype=my_sourcetype earliest=-4h latest=now
| sort 0 _time
| rex field=message "- (?<User>\S+) -:"
| rex field=message "- (?<MessageInfo>\S+) :"
| eval Err=if(match(MessageInfo, "(Error example 1)|(Error example 2)"), 1, 0)
| eval Succ=if(match(MessageInfo, "(Success example 1)|(Success example 2)"), 1, 0)
| streamstats reset_before="Succ==1" time_window=4h count by User
| stats max(_time) as _time latest(MessageInfo) as MessageInfo max(Err) as Err max(Succ) as Succ max(count) as count by User
| eval ErrCount=if(Succ==1, count-1, count)
| fields - count
| where ErrCount>9

We subtract 1 from count when Succ==1 to exclude the success event from the error count. When no success events have occurred, the count value doesn't require correction.

Getting Amount of X Events After Last Y Event

eval

stats

Announcing Scheduled Export GA for Dashboard Studio

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!