Getting Data In

index entire file as a single event but avoid duplicate indexing

ishaanshekhar
Communicator

I need to monitor a folder where each file should be treated as a single event.

The files get their entire content over some time (usually hours).

Initially, loosely segregated events used to get created for the same file as the file would get modified over time.

To avoid that, I applied checksum-config-check as "entire_md5". This avoids the loosely segregated events by combining them into one single event for the entire file. That is good, however, I see duplicate events with same content (entire file).

Could you please help me out figuring how to avoid these duplicate events? May be a way splunk automatically delete the duplicate events and retain only one event per file?

Thanks
Ishaan

0 Karma
1 Solution

martin_mueller
SplunkTrust
SplunkTrust

Move the files to a different place, or mark them with an extension like .complete, when you're done writing them. Have Splunk only monitor that different place or whitelist only that extension.

You could fix the duplicates at search time, something like | dedup source, but don't expect blazing performance.
Automatically deleting events is really bad mojo.

View solution in original post

martin_mueller
SplunkTrust
SplunkTrust

Move the files to a different place, or mark them with an extension like .complete, when you're done writing them. Have Splunk only monitor that different place or whitelist only that extension.

You could fix the duplicates at search time, something like | dedup source, but don't expect blazing performance.
Automatically deleting events is really bad mojo.

martin_mueller
SplunkTrust
SplunkTrust

Modifying the cronjob/whatever that moves the files there is by far the best approach.
In general, it's best to avoid a mess than making a mess and making another mess trying to clean up the first mess.

Deleting duplicates automatically is entirely un-splunky. Running frequent calls to the | delete command is going to severely mess up your buckets, and doesn't reclaim space. Ways to reclaim space (age retention, space restrictions, cleaning entire index) all aren't selective enough for your case.

0 Karma

ishaanshekhar
Communicator

Thanks @martin_mueller!

I can't go for the .complete suggestion for a few reasons.

Your second suggestion is nice but the search may extend for longer periods, such as over a few months and that makes the searches too slow to load.

How about deleting duplicate events automatically? You mentioned it is not a good option but is there any other option in my case?

Thanks
Ishaan

0 Karma
Get Updates on the Splunk Community!

Adoption of RUM and APM at Splunk

    Unleash the power of Splunk Observability   Watch Now In this can't miss Tech Talk! The Splunk Growth ...

Routing logs with Splunk OTel Collector for Kubernetes

The Splunk Distribution of the OpenTelemetry (OTel) Collector is a product that provides a way to ingest ...

Welcome to the Splunk Community!

(view in My Videos) We're so glad you're here! The Splunk Community is place to connect, learn, give back, and ...