Knowledge Management

Dedup function is not working in "fill_summary_index.py" script

Rahul_a
Observer

I need to backfill some missing data into the summary index. However, there are already a few data present in the same index. Therefore, I only want to backfill the remaining events, and the data that is already present should not be injected again. I am currently using the 'fill_summary_index.py' script, but during testing, it seems to inject duplicate data, indicating that the deduplication function is not working correctly in this script. Please help me by providing a proper script to address this issue.

Labels (1)
0 Karma

Richfez
SplunkTrust
SplunkTrust

Have you tried the -dedup option for the fill_summary_index.py?

Run your fill_summary_index.py script with '-h', like

$ splunk cmd python fill_summary_index.py -h

There's all sorts of options in there, including dedup and timeframe changes.  It might be useful to spend a few minutes reading that carefully.

You may also find it useful to review the fine docs on this:

https://docs.splunk.com/Documentation/Splunk/latest/Knowledge/Managesummaryindexgapsandoverlaps

 

Happy Splunking! 

0 Karma

Rahul_a
Observer

Hi @Richfez 

Yes, I have tried using the '-dedup' option with the value set to 'true' in the fill_summary_index.py script.

I've been using the following command for the fill_summary_index.py script:
./splunk cmd python fill_summary_index.py -app search -name "test report" -et -24h@h -lt now -index raindex -dedup true -auth admin:password

I carefully reviewed the documentation and the script before testing, but I couldn't find a solution. If there are any specific parameters or configurations that I might be missing, please guide me on how to use them effectively for preventing duplicate data injection.

Your assistance is much appreciated.

0 Karma

Richfez
SplunkTrust
SplunkTrust

Ah - sometimes the easy answer are the answer, but sometimes they're not!

So, from what I can see of fill_summary_index.py, the dedup option isn't actually magic.  That means there's no reason you can't just make a few minor modifications (mostly to timeframes) to just backfill the summary index manually.

Indeed, there's no magic here anyway.  If fill_summary_index.py is not filling in your blank areas in the summary index correctly using the saved search from the "regular" collector for the summary index, then it seems to me that it's likely that the main search simply isn't working right anyway.

The reasoning here is that when it runs 'normally', it's running over a time period and dumping its output to that summary index. This is exactly what the backfilling version does, with the only difference being that it sets a different start/end time.  Again, no magic, just searches running over time periods.

So, a couple of ways forward.

1) You could provide the search and maybe  we can spot why it doesn't work right for backfilling.

2) You could craft up a "deduplication search" that you can pass to the backfill function to tell it *how* to identify which periods need backfilling.  (I don't know how to do this, but the notes for the backfill function says you can do this, so I believe it.  And of course, just because I don't know how to do it right now doesn't mean we can't help figure it out, or someone else might!)

3) Or maybe you can just manually run the search that would do the backfilling, only manually selecting the timeframes so that you don't get duplication.  I mean, I'd guess it's just a standard saved search that ends up with `| collect...` at the end.  🙂

Anyway, I do hope this helps, and maybe this bump will get someone else who does this a lot to chime in - we'll see!

 

0 Karma
Get Updates on the Splunk Community!

Join Us for Splunk University and Get Your Bootcamp Game On!

If you know, you know! Splunk University is the vibe this summer so register today for bootcamps galore ...

.conf24 | Learning Tracks for Security, Observability, Platform, and Developers!

.conf24 is taking place at The Venetian in Las Vegas from June 11 - 14. Continue reading to learn about the ...

Announcing Scheduled Export GA for Dashboard Studio

We're excited to announce the general availability of Scheduled Export for Dashboard Studio. Starting in ...