I'd like to see if there's a "right" way to solve this problem. I've got a lot of delayed entry for data that gets summary indexed on an hourly basis. Most data gets into the system between 30 and 90 minutes late, and some of it gets into the system up to 48 hours late. The volume of data is such that I need hourly and daily summary indexing to make searches reasonable.
What I've been doing thus far is running my hourly search at the half hour, with the time window of et=-3h@h lt=-2h@h. Then I run a second search after midnight that generates hourly data for et=-2d@d lt=@h. All of my searches support duplicate entries by running a | stats first(myvar) by _time
.
This works acceptably, but is a bit kludgy. Apart from getting the data in real-time (if wishes were horses), is there a better way to approach this? (This question is related to, but different from another question of mine.)
How about using fill_summary_index.py
to backfill the missing data? Take a look at the following:
Manage summary index gaps and overlaps
Interesting. Have you considered filing an enhancement request?
Late reply: Possibly, what I really want is a forced overwrite. dedup=true will not run indexing for periods already indexed -- One way to improve my method would be to forcefully run indexing for periods that have already been indexed, and overwrite the data already present. But based on my understanding of bucketing, I don't think that Splunk has any such functionality..
What about using the script with -dedup=true?
Yeah. A lot of why I'm uncomfortable with the method I have is that it will result in at least a triplication of the data, and I'm thinking of increasing hourly search to et=-3h@h lt=now, which would potentially increase the size of the data sixfold.
This data is pretty tiny (raw: 1.6 MB per day) so it's not really a problem, but we're looking at expanding this out to where we could get a few GB per day, and at that point it would be more problematic. Add unto that, tossing old data into a new bucket can have performance implications.. etc.
It just seems that there should be a better way =D
True, true. I figured why not put up a cron job with that script and let it handle the gaps rather than the searches. Not sure if that'll yield any performance benefits tho.
I think that's functionally similar to having a the summary index queries re-run the same information, but just slightly more manual (or at least outside the control of Splunk). But I definitely use that periodically, when I need to backfill a larger amount of data.