Getting Data In

index entire file as a single event but avoid duplicate indexing

ishaanshekhar
Communicator

I need to monitor a folder where each file should be treated as a single event.

The files get their entire content over some time (usually hours).

Initially, loosely segregated events used to get created for the same file as the file would get modified over time.

To avoid that, I applied checksum-config-check as "entire_md5". This avoids the loosely segregated events by combining them into one single event for the entire file. That is good, however, I see duplicate events with same content (entire file).

Could you please help me out figuring how to avoid these duplicate events? May be a way splunk automatically delete the duplicate events and retain only one event per file?

Thanks
Ishaan

0 Karma
1 Solution

martin_mueller
SplunkTrust
SplunkTrust

Move the files to a different place, or mark them with an extension like .complete, when you're done writing them. Have Splunk only monitor that different place or whitelist only that extension.

You could fix the duplicates at search time, something like | dedup source, but don't expect blazing performance.
Automatically deleting events is really bad mojo.

View solution in original post

martin_mueller
SplunkTrust
SplunkTrust

Move the files to a different place, or mark them with an extension like .complete, when you're done writing them. Have Splunk only monitor that different place or whitelist only that extension.

You could fix the duplicates at search time, something like | dedup source, but don't expect blazing performance.
Automatically deleting events is really bad mojo.

martin_mueller
SplunkTrust
SplunkTrust

Modifying the cronjob/whatever that moves the files there is by far the best approach.
In general, it's best to avoid a mess than making a mess and making another mess trying to clean up the first mess.

Deleting duplicates automatically is entirely un-splunky. Running frequent calls to the | delete command is going to severely mess up your buckets, and doesn't reclaim space. Ways to reclaim space (age retention, space restrictions, cleaning entire index) all aren't selective enough for your case.

0 Karma

ishaanshekhar
Communicator

Thanks @martin_mueller!

I can't go for the .complete suggestion for a few reasons.

Your second suggestion is nice but the search may extend for longer periods, such as over a few months and that makes the searches too slow to load.

How about deleting duplicate events automatically? You mentioned it is not a good option but is there any other option in my case?

Thanks
Ishaan

0 Karma
Got questions? Get answers!

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Meet up IRL or virtually!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Get Updates on the Splunk Community!

Announcing Modern Navigation: A New Era of Splunk User Experience

We are excited to introduce the Modern Navigation feature in the Splunk Platform, available to both cloud and ...

Modernize your Splunk Apps – Introducing Python 3.13 in Splunk

We are excited to announce that the upcoming releases of Splunk Enterprise 10.2.x and Splunk Cloud Platform ...

Step into “Hunt the Insider: An Splunk ES Premier Mystery” to catch a cybercriminal ...

After a whole week of being on call, you fell asleep on your keyboard, and you hit a sequence of buttons that ...