Knowledge Management

Managing Late Arriving Data in Summary Indexing

marlog
Explorer

Does anyone know of best practices around managing Summary Indexes in a consistent way?

  1. Let’s say that some data occasionally arrives late (eg. forwarder was down). The scheduled search that populates summary index will calculate stats without this data. Later on, the data arrives, but the stats in the summary index are already incorrect. There is fill_summary_index.py. However, if I run it with “-dedup true” it will not re-calculate statistics that already exist. If I run it without dedup, it will not replace the existing statistics but add new ones. In other words, I’ll have two records, such as “3/24/17 10:30:00.000 XYZ=5” AND “3/24/17 10:30:00.000 XYZ=10”. This would make it hard to know which entry is the correct one. This will also fill the index over time with unnecessary data. Are there known ways to deal with such scenario?

  2. What are some best practices around managing this in a consistent way? Occasionally, data can arrive late from different sources without me even knowing about it (eg, someone stops/restarts the forwarder). So if fill_summary_index.py was re-calculating and replacing records instead of adding them, I could schedule this script to run over the weekend from the beginning of time and correct anything that might have gapped. Can I do this somehow?

  3. How do Accelerated Reports deal with late data arrivals? Would they detect it? Or should I go and manually trigger “Rebuild” for them from time-to-time? Is there any way to automatically trigger the "Rebuild" so it always runs over the weekend?

0 Karma

woodcock
Esteemed Legend

Your best bet (if possible) is to convert to accelerated data models. They faithfully deal with late-arriving events with no upkeep or mitigation required.

marlog
Explorer

Yes, I'll have to investigate Data Models since I have not used them yet. However, looking at documentation, it seems like we can Accelerate only those Data Models that are made up of streaming commands (similar to Accelerated Reports). The advantage of summary indexing is that it can work for any searches (for example transactions). Do you think if I replace my Summary Indexing with Models that cannot be Accelerated, I would actually see any Acceleration?

0 Karma

woodcock
Esteemed Legend

Once you switch to DMs, you can use tstats and you will be FLYING:

https://answers.splunk.com/answers/406962/where-can-i-find-detailed-documentation-for-using.html

0 Karma

DalJeanis
Legend

You could make part of the populating search based on the splunk index time of the events. For instance, if you are summarizing hourly stats, running at 8:15 am for items indexed between 7:00 and 8:00. This way, if multiple summary jobs put summary data into the index for the same hour (event-time), the "correct" entry is ALL of them. You could merge the records after a couple of days, if you wanted to. Also, how long you delay before the initial summarizing would depend on your typical use-case. Some companies, fifteen minutes after the hour would be fine, others would be summarizing the hour's data 24 hours later.

0 Karma
Get Updates on the Splunk Community!

Developer Spotlight with Paul Stout

Welcome to our very first developer spotlight release series where we'll feature some awesome Splunk ...

State of Splunk Careers 2024: Maximizing Career Outcomes and the Continued Value of ...

For the past four years, Splunk has partnered with Enterprise Strategy Group to conduct a survey that gauges ...

Data-Driven Success: Splunk & Financial Services

Splunk streamlines the process of extracting insights from large volumes of data. In this fast-paced world, ...