Knowledge Management

Best Practice for Updating Summary Indexed Data

David
Splunk Employee
Splunk Employee

I'd like to see if there's a "right" way to solve this problem. I've got a lot of delayed entry for data that gets summary indexed on an hourly basis. Most data gets into the system between 30 and 90 minutes late, and some of it gets into the system up to 48 hours late. The volume of data is such that I need hourly and daily summary indexing to make searches reasonable.

What I've been doing thus far is running my hourly search at the half hour, with the time window of et=-3h@h lt=-2h@h. Then I run a second search after midnight that generates hourly data for et=-2d@d lt=@h. All of my searches support duplicate entries by running a | stats first(myvar) by _time.

This works acceptably, but is a bit kludgy. Apart from getting the data in real-time (if wishes were horses), is there a better way to approach this? (This question is related to, but different from another question of mine.)

Tags (1)

ftk
Motivator

How about using fill_summary_index.py to backfill the missing data? Take a look at the following:
Manage summary index gaps and overlaps

0 Karma

ftk
Motivator

Interesting. Have you considered filing an enhancement request?

0 Karma

David
Splunk Employee
Splunk Employee

Late reply: Possibly, what I really want is a forced overwrite. dedup=true will not run indexing for periods already indexed -- One way to improve my method would be to forcefully run indexing for periods that have already been indexed, and overwrite the data already present. But based on my understanding of bucketing, I don't think that Splunk has any such functionality..

0 Karma

ftk
Motivator

What about using the script with -dedup=true?

0 Karma

David
Splunk Employee
Splunk Employee

Yeah. A lot of why I'm uncomfortable with the method I have is that it will result in at least a triplication of the data, and I'm thinking of increasing hourly search to et=-3h@h lt=now, which would potentially increase the size of the data sixfold.

This data is pretty tiny (raw: 1.6 MB per day) so it's not really a problem, but we're looking at expanding this out to where we could get a few GB per day, and at that point it would be more problematic. Add unto that, tossing old data into a new bucket can have performance implications.. etc.

It just seems that there should be a better way =D

0 Karma

ftk
Motivator

True, true. I figured why not put up a cron job with that script and let it handle the gaps rather than the searches. Not sure if that'll yield any performance benefits tho.

0 Karma

David
Splunk Employee
Splunk Employee

I think that's functionally similar to having a the summary index queries re-run the same information, but just slightly more manual (or at least outside the control of Splunk). But I definitely use that periodically, when I need to backfill a larger amount of data.

0 Karma
Take the 2021 Splunk Career Survey

Help us learn about how Splunk has
impacted your career by taking the 2021 Splunk Career Survey.

Earn $50 in Amazon cash!