topic efficient way of comparing with historic events using percentile in Splunk Search

efficient way of comparing with historic events using percentile

kundeng — Fri, 13 Jun 2014 22:08:46 GMT

Hi, I have a query that is meant to compare longitudinal count of an event of a given day (e.g. today) with historical longitudinal percentiles.

query 1:

index=interfaces sourcesession="MICHART_SIU_HL7_INBOUND"  [`define_relative_week(1)`] date_wday!=saturday date_wday!=sunday   | bin _time span=20m |  stats  count as count by _time  | eval timeofday=tonumber(strftime(_time, "%H"))*3600 + tonumber(strftime(_time,"%M"))*60  | stats perc5(count) as perc5 perc95(count) as perc95 by timeofday  |  eval timeofday= tostring(timeofday,"duration")  | join type=outer timeofday [search index=interfaces sourcesession="MICHART_SIU_HL7_INBOUND"   | bin _time span=20m  | eval timeofday=tonumber(strftime(_time, "%H"))*3600 + tonumber(strftime(_time,"%M"))*60| eval timeofday= tostring(timeofday,"duration")| stats  count by timeofday] |sort timeofday

where define_relative_week is a macro:

stats count | addinfo | eval earliest=(info_min_time-604800*$n$) | eval earliest=strftime(earliest,"%m/%d/%Y:%H:%M:%S") | eval latest=(info_max_time-86400) | eval latest=strftime(latest,"%m/%d/%Y:%H:%M:%S") | return earliest,latest

Basically, this search grabs historical data (previous week) and bins the time, and then compute percentile by binned time.

Some of the things that I'm grappling with are:

multiple event types. How to generalize this to multiple event types? For example, I have many types of events (e.g. with different sourcesession values). And recomputing the historical percentile for each type of events is going to be too much for computations.
Acceleration. Suppose for now, there is only one event of interest. How can I further accelerate this search? I need to set-up an alarm by comparing current count with 5% and 95% percentiles for the same time period, and the saved alarm/search will run every 15 minutes and inspect the previous 15 minutes to see whether there is any anomaly.
I tried the following query which filters out irrelevant time ranges. But to my surprise it runs slower than the original query. Any insights why?

query 2:

| gentimes  end=-1 increment=24h [stats count | addinfo | eval start=(info_min_time-604800*2) | eval start=strftime(start,"%m/%d/%y:%H:%M:00")| return start]
| rename starttime as earliest 
| eval latest=earliest+20*60
| sort - earliest
| map maxsearches=99999
      search="search earliest=$earliest$ latest=$latest$ index=interfaces sourcesession=MICHART_SIU_HL7_INBOUND  " | ...

Re: efficient way of comparing with historic events using percentile

lguinn2 — Sat, 14 Jun 2014 00:34:10 GMT

Wow - this is complicated. I think you can simplify a lot by using Splunk's built-in time math...

index=interfaces earliest=-7d latest=@d sourcesession="MICHART_SIU_HL7_INBOUND"
| bucket _time span=20m | eval timeOfDay=strftime(_time,"%H:%M")
| stats count by timeOfDay
| start p5(count) as perc5 p95(count) as perc95 by timeOfDay 
| append [ search index=interfaces earliest=@d
     | bucket _time span=20m | eval timeOfDay=strftime(_time,"%H:%M")
     | stats count by timeOfDay  ]
| chart first(perc5) as perc5 first(perc95) as perc95 first(count) as count by timeOfDay

This makes a nice chart. If you want to build an alert - try this

index=interfaces earliest=-7d latest=@d 
| bucket _time span=20m | eval timeOfDay=strftime(_time,"%H:%M")
| stats count by timeOfDay sourcesession
| start p5(count) as perc5 p95(count) as perc95 by timeOfDay sourcesession
| append [ search index=interfaces earliest=@d
     | bucket _time span=20m | eval timeOfDay=strftime(_time,"%H:%M")
     | stats count by timeOfDay sourcesession ]
| stats first(perc5) as perc5 first(perc95) as perc95 first(count) as count by timeOfDay sourcesession
| where count < perc5 OR count > perc95

Set the alert condition for "# results > 0". Note that the alert checks across all source sessions at once. If you leave off the last line, it will make a report as well, but it won't look as nice as the first one.

Oh, I didn't use the macro at all.

Re: efficient way of comparing with historic events using percentile

kundeng — Mon, 16 Jun 2014 16:28:07 GMT

doesn't quite work yet. Besides some errors:
*"stats count by _time sourcesession" should be added before eval.
*typo "start"

There are two intertwining problems in result table. Suppose current time is 10:20, and suppose between 9:00-10:00, there is NO data at all for a sourcesession.
1. The abnormal row will be filtered out, because it's count==NULL. However, if I fill NULL count with 0,
2. then I will include extra rows such as rows for timeofday=11:00, because its count is also NULL.

How do I deal with this?

Re: efficient way of comparing with historic events using percentile

kundeng — Thu, 19 Jun 2014 14:26:28 GMT

So here is the query that is being tested and it needs to be tweaked to further reduce false positives.

index=interfaces earliest=-10d latest=@d    | bucket _time span=10m   | stats count by _time sourcesession  | eval timeOfDay=strftime(_time,"%H:%M")  | stats p5(count) as perc5 p95(count) as perc95 by timeOfDay sourcesession  | append   [ search index=interfaces earliest=-40m latest=-10m       | bucket _time span=10m | eval timeOfDay=strftime(_time,"%H:%M")      | stats count by timeOfDay sourcesession |fillnull ]
 | stats first(perc5) as perc5 first(perc95) as perc95 first(count) as count by timeOfDay sourcesession | eval etime=strptime(timeOfDay,"%H:%M") | where etime>=relative_time(now(),"-40m") AND etime<relative_time(now(),"-10m") | fillnull | where (count<perc5) | fields - etime | sort sourcesession timeOfDay | stats count by sourcesession |search count>=3