topic Re: Parallel stats - most efficient structure in Splunk Search

Parallel stats - most efficient structure

JeToJedno — Wed, 07 Feb 2018 11:18:26 GMT

I frequently have to create stats reports where some parts are, essentially, executable in parallel with others. An example would be:

search <something> 
  | appendpipe [ eval _time=FLOOR(_time/60)*60 | stats DC(id) unique_devices_minute BY device_type, _time ]
  | appendpipe [ eval _time=FLOOR(_time/3600)*3600 | stats DC(id) unique_devices_hour BY device_type, _time ]
  | appendpipe [ eval _time=FLOOR(_time/86400)*86400 | stats DC(id) unique_devices_day BY device_type, _time ]
  | eval _time=FLOOR(_time/60)*60
  | stats <<some summaries>> SUM(unique_devices_minute) AS unique_devices_minute, SUM(unique_devices_hour) AS unique_devices_hour, SUM(unique_devices_day) AS unique_devices_day BY device_type, _time
  | eval period="minute"
  | appendpipe [ eval _time=FLOOR(_time/3600)*3600 
    | stats <<sum per-minute to per-hour>> SUM(unique_devices_hour) AS unique_devices_hour, SUM(unique_devices_day) AS unique_devices_day BY device_type, _time
    | eval period="hour" ]
  | appendpipe [ where period="hour" | eval _time=FLOOR(_time/86400)*86400 
    | stats <<sum per-hour to per-day>>  SUM(unique_devices_day) AS unique_devices_day BY device_type, _time
    | eval period="day" ]
  | eval unique_devices=CASE(period="minute",unique_devices_minute,period="hour",unique_devices_hour,eval period="day",unique_devices_day ) | fields - unique_devices_minute, unique_devices_hour, unique_devices_day

This gives the results I want in a single report, but is it the most efficient way to structure this?

Re: Parallel stats - most efficient structure

JeToJedno — Wed, 07 Feb 2018 11:20:09 GMT

Before anyone says anything, the first appendpipe (line 2) is unnecessary and can be part of the stats command in line 6.

... and there's a spurious "eval" in line 14. 😞

Re: Parallel stats - most efficient structure

cmerriman — Wed, 07 Feb 2018 13:11:39 GMT

you could do something like this:

|multireport
[|bucket _time span=1min| stats DC(id) as unique_devices_minute BY device_type _time|eval period="minute"]
[|bucket _time span=1h| stats DC(id) as unique_devices_hour BY device_type _time|eval period="hour"]
[|bucket _time span=1d| stats DC(id) as unique_devices_day BY device_type _time|eval period="day"]
| eval unique_devices=CASE(period="minute",unique_devices_minute,period="hour",unique_devices_hour, period="day",unique_devices_day )
| fields - unique_devices_minute, unique_devices_hour, unique_devices_day

I see that you're doing some sort of summing by minute/hour/day before you eval unique_devices in separate appendpipes, but i'm not entirely sure what that's doing in the end. some sample data and desired output might be more helpful.

Re: Parallel stats - most efficient structure

JeToJedno — Wed, 07 Feb 2018 13:36:48 GMT

The other stats calculated are always average, peak & maximum request rates, and sometimes first-seen-time (in day) unique devices count. Peak rate = max of 2 sec moving average request rate or a close approximation (e.g. 98th percentile within minute of per-second request rate, and then max of per-minute 98th percentile for hour & day).

I was hoping to avoid making multiple independent passes through the whole dataset, but I can see now that's probably not going to be possible as the count distinct over a different time period will always require a separate pass through the base data. 😞