Does anyone know of best practices around managing Summary Indexes in a consistent way?
Let’s say that some data occasionally arrives late (eg. forwarder was down). The scheduled search that populates summary index will calculate stats without this data. Later on, the data arrives, but the stats in the summary index are already incorrect. There is fill_summary_index.py. However, if I run it with “-dedup true” it will not re-calculate statistics that already exist. If I run it without dedup, it will not replace the existing statistics but add new ones. In other words, I’ll have two records, such as “3/24/17 10:30:00.000 XYZ=5” AND “3/24/17 10:30:00.000 XYZ=10”. This would make it hard to know which entry is the correct one. This will also fill the index over time with unnecessary data. Are there known ways to deal with such scenario?
What are some best practices around managing this in a consistent way? Occasionally, data can arrive late from different sources without me even knowing about it (eg, someone stops/restarts the forwarder). So if fill_summary_index.py was re-calculating and replacing records instead of adding them, I could schedule this script to run over the weekend from the beginning of time and correct anything that might have gapped. Can I do this somehow?
How do Accelerated Reports deal with late data arrivals? Would they detect it? Or should I go and manually trigger “Rebuild” for them from time-to-time? Is there any way to automatically trigger the "Rebuild" so it always runs over the weekend?
Yes, I'll have to investigate Data Models since I have not used them yet. However, looking at documentation, it seems like we can Accelerate only those Data Models that are made up of streaming commands (similar to Accelerated Reports). The advantage of summary indexing is that it can work for any searches (for example transactions). Do you think if I replace my Summary Indexing with Models that cannot be Accelerated, I would actually see any Acceleration?
You could make part of the populating search based on the splunk index time of the events. For instance, if you are summarizing hourly stats, running at 8:15 am for items indexed between 7:00 and 8:00. This way, if multiple summary jobs put summary data into the index for the same hour (event-time), the "correct" entry is ALL of them. You could merge the records after a couple of days, if you wanted to. Also, how long you delay before the initial summarizing would depend on your typical use-case. Some companies, fifteen minutes after the hour would be fine, others would be summarizing the hour's data 24 hours later.