Knowledge Management

Managing Late Arriving Data in Summary Indexing

marlog
Explorer

Does anyone know of best practices around managing Summary Indexes in a consistent way?

  1. Let’s say that some data occasionally arrives late (eg. forwarder was down). The scheduled search that populates summary index will calculate stats without this data. Later on, the data arrives, but the stats in the summary index are already incorrect. There is fill_summary_index.py. However, if I run it with “-dedup true” it will not re-calculate statistics that already exist. If I run it without dedup, it will not replace the existing statistics but add new ones. In other words, I’ll have two records, such as “3/24/17 10:30:00.000 XYZ=5” AND “3/24/17 10:30:00.000 XYZ=10”. This would make it hard to know which entry is the correct one. This will also fill the index over time with unnecessary data. Are there known ways to deal with such scenario?

  2. What are some best practices around managing this in a consistent way? Occasionally, data can arrive late from different sources without me even knowing about it (eg, someone stops/restarts the forwarder). So if fill_summary_index.py was re-calculating and replacing records instead of adding them, I could schedule this script to run over the weekend from the beginning of time and correct anything that might have gapped. Can I do this somehow?

  3. How do Accelerated Reports deal with late data arrivals? Would they detect it? Or should I go and manually trigger “Rebuild” for them from time-to-time? Is there any way to automatically trigger the "Rebuild" so it always runs over the weekend?

0 Karma

woodcock
Esteemed Legend

Your best bet (if possible) is to convert to accelerated data models. They faithfully deal with late-arriving events with no upkeep or mitigation required.

marlog
Explorer

Yes, I'll have to investigate Data Models since I have not used them yet. However, looking at documentation, it seems like we can Accelerate only those Data Models that are made up of streaming commands (similar to Accelerated Reports). The advantage of summary indexing is that it can work for any searches (for example transactions). Do you think if I replace my Summary Indexing with Models that cannot be Accelerated, I would actually see any Acceleration?

0 Karma

woodcock
Esteemed Legend

Once you switch to DMs, you can use tstats and you will be FLYING:

https://answers.splunk.com/answers/406962/where-can-i-find-detailed-documentation-for-using.html

0 Karma

DalJeanis
Legend

You could make part of the populating search based on the splunk index time of the events. For instance, if you are summarizing hourly stats, running at 8:15 am for items indexed between 7:00 and 8:00. This way, if multiple summary jobs put summary data into the index for the same hour (event-time), the "correct" entry is ALL of them. You could merge the records after a couple of days, if you wanted to. Also, how long you delay before the initial summarizing would depend on your typical use-case. Some companies, fifteen minutes after the hour would be fine, others would be summarizing the hour's data 24 hours later.

0 Karma
Got questions? Get answers!

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Meet up IRL or virtually!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Get Updates on the Splunk Community!

May 2026 Splunk Expert Sessions: Security & Observability

Level Up Your Operations: May 2026 Splunk Expert Sessions Whether you are refining your security posture or ...

Network to App: Observability Unlocked [May & June Series]

In today’s digital landscape, your environment is no longer confined to the data center. It spans complex ...

SPL2 Deep Dives, AppDynamics Integrations, SAML Made Simple and Much More on Splunk ...

Splunk Lantern is Splunk’s customer success center that provides practical guidance from Splunk experts on key ...