Getting Data In

index entire file as a single event but avoid duplicate indexing

ishaanshekhar
Communicator

I need to monitor a folder where each file should be treated as a single event.

The files get their entire content over some time (usually hours).

Initially, loosely segregated events used to get created for the same file as the file would get modified over time.

To avoid that, I applied checksum-config-check as "entire_md5". This avoids the loosely segregated events by combining them into one single event for the entire file. That is good, however, I see duplicate events with same content (entire file).

Could you please help me out figuring how to avoid these duplicate events? May be a way splunk automatically delete the duplicate events and retain only one event per file?

Thanks
Ishaan

0 Karma
1 Solution

martin_mueller
SplunkTrust
SplunkTrust

Move the files to a different place, or mark them with an extension like .complete, when you're done writing them. Have Splunk only monitor that different place or whitelist only that extension.

You could fix the duplicates at search time, something like | dedup source, but don't expect blazing performance.
Automatically deleting events is really bad mojo.

View solution in original post

martin_mueller
SplunkTrust
SplunkTrust

Move the files to a different place, or mark them with an extension like .complete, when you're done writing them. Have Splunk only monitor that different place or whitelist only that extension.

You could fix the duplicates at search time, something like | dedup source, but don't expect blazing performance.
Automatically deleting events is really bad mojo.

martin_mueller
SplunkTrust
SplunkTrust

Modifying the cronjob/whatever that moves the files there is by far the best approach.
In general, it's best to avoid a mess than making a mess and making another mess trying to clean up the first mess.

Deleting duplicates automatically is entirely un-splunky. Running frequent calls to the | delete command is going to severely mess up your buckets, and doesn't reclaim space. Ways to reclaim space (age retention, space restrictions, cleaning entire index) all aren't selective enough for your case.

0 Karma

ishaanshekhar
Communicator

Thanks @martin_mueller!

I can't go for the .complete suggestion for a few reasons.

Your second suggestion is nice but the search may extend for longer periods, such as over a few months and that makes the searches too slow to load.

How about deleting duplicate events automatically? You mentioned it is not a good option but is there any other option in my case?

Thanks
Ishaan

0 Karma
Get Updates on the Splunk Community!

Enhance Security Visibility with Splunk Enterprise Security 7.1 through Threat ...

(view in My Videos)Struggling with alert fatigue, lack of context, and prioritization around security ...

Troubleshooting the OpenTelemetry Collector

  In this tech talk, you’ll learn how to troubleshoot the OpenTelemetry collector - from checking the ...

Adoption of Infrastructure Monitoring at Splunk

  Splunk's Growth Engineering team showcases one of their first Splunk product adoption-Splunk Infrastructure ...