Best practice for a sliding window over a large da...

GSnook · ‎08-02-2020

I have a very large, slow data set to query, and I'd like to provide a report covering a sliding window from -21 days ago until 'now' for use on a dashboard. It's far too slow to run a query like this on demand. To work around that, I've added multiple reports:
A) a daily scheduled report to query 21 days of data ending on the start of the current day. Takes 30 minutes to run each night.
B) an hourly scheduled report to query data from the start of the current day until the top of the hour. takes ~5 minutes.
C) a final real-time report that uses `append' and 'loadjob` to combine A and B, plus gathers any new data from the last time B was run.

The dashboard then references report C and sets it as the base search for the dashboard. The end result is 21 days of recent data with minimal wait time for the dashboard user.

It's working, but clunky. The dashboard can assemble the data in about a minute, but report C ends up duplicating all the data from A and B, which is causing storage issues. I'm hoping there is a better way to maintain a sliding window of data that uses a more efficient means to add new events and purge the old ones. Suggestions?

bowesmana · ‎08-02-2020

Are you able to run daily reports that collect summary index results and make your 21 day report uses data from that summarised data. Of course, it will depend on the aggregations you are doing, some are easier to summarise than others, e.g. it's not space efficient to calculate percentiles on high cardinality data over 1 minute, if you are looking to then calculate percentiles over a larger window.

Generally if you can design your final reports to generate from summarised data, then you will naturally be reading significantly reduced data sets. Simple counts and averages can be summarised - choosing your base time interval for summary is important - I used to aggregate 1million rows/minute to 1 minute granularity. You can reconstruct averages if you want to change your time interval in the larger time window reports, but reconstructing percentiles is tricker, as you need to use sistats to retain the data set to calculate on a different time window.

The second approach is to make sure your searches are efficient, aggregating and discarding data as high up the pipeline as possible and avoiding slow/inefficient operations, such as eventstats, transaction, join and sort.

GSnook · ‎08-06-2020

Unfortunately, no. The data being collected is individual test results in the form of a csv file of measurements output by each test run. The hope is for Dashboard users to be able to look at trends between one or more tests over the 21 day window as well as drill in to individual test results.

Inefficient searches are definitely part of the problem. Gathering the full set of data requires two lookups per test result. I can attack that, but I was curious if there was a better way to maintain a 21 day window of data by appending the new and removing the old rather than recreating the entire 21 day set each day.

Best practice for a sliding window over a large data set?

report acceleration

saved search

scheduled search

Can’t make it to .conf25? Join us online!

Leveraging Automated Threat Analysis Across the Splunk Ecosystem

Can’t Make It to Boston? Stream .conf25 and Learn with Haya Husain

Splunk Lantern’s Guide to The Most Popular .conf25 Sessions

Are you a member of the Splunk Community?