Getting Data In

index entire file as a single event but avoid duplicate indexing

ishaanshekhar
Communicator

I need to monitor a folder where each file should be treated as a single event.

The files get their entire content over some time (usually hours).

Initially, loosely segregated events used to get created for the same file as the file would get modified over time.

To avoid that, I applied checksum-config-check as "entire_md5". This avoids the loosely segregated events by combining them into one single event for the entire file. That is good, however, I see duplicate events with same content (entire file).

Could you please help me out figuring how to avoid these duplicate events? May be a way splunk automatically delete the duplicate events and retain only one event per file?

Thanks
Ishaan

0 Karma
1 Solution

martin_mueller
SplunkTrust
SplunkTrust

Move the files to a different place, or mark them with an extension like .complete, when you're done writing them. Have Splunk only monitor that different place or whitelist only that extension.

You could fix the duplicates at search time, something like | dedup source, but don't expect blazing performance.
Automatically deleting events is really bad mojo.

View solution in original post

martin_mueller
SplunkTrust
SplunkTrust

Move the files to a different place, or mark them with an extension like .complete, when you're done writing them. Have Splunk only monitor that different place or whitelist only that extension.

You could fix the duplicates at search time, something like | dedup source, but don't expect blazing performance.
Automatically deleting events is really bad mojo.

View solution in original post

martin_mueller
SplunkTrust
SplunkTrust

Modifying the cronjob/whatever that moves the files there is by far the best approach.
In general, it's best to avoid a mess than making a mess and making another mess trying to clean up the first mess.

Deleting duplicates automatically is entirely un-splunky. Running frequent calls to the | delete command is going to severely mess up your buckets, and doesn't reclaim space. Ways to reclaim space (age retention, space restrictions, cleaning entire index) all aren't selective enough for your case.

0 Karma

ishaanshekhar
Communicator

Thanks @martin_mueller!

I can't go for the .complete suggestion for a few reasons.

Your second suggestion is nice but the search may extend for longer periods, such as over a few months and that makes the searches too slow to load.

How about deleting duplicate events automatically? You mentioned it is not a good option but is there any other option in my case?

Thanks
Ishaan

0 Karma
.conf21 Now Fully Virtual!
Register for FREE Today!

We've made .conf21 totally virtual and totally FREE! Our completely online experience will run from 10/19 through 10/20 with some additional events, too!