Solved: index entire file as a single event but avoid dupl...

ishaanshekhar · ‎03-23-2016

I need to monitor a folder where each file should be treated as a single event.

The files get their entire content over some time (usually hours).

Initially, loosely segregated events used to get created for the same file as the file would get modified over time.

To avoid that, I applied checksum-config-check as "entire_md5". This avoids the loosely segregated events by combining them into one single event for the entire file. That is good, however, I see duplicate events with same content (entire file).

Could you please help me out figuring how to avoid these duplicate events? May be a way splunk automatically delete the duplicate events and retain only one event per file?

Thanks
Ishaan

martin_mueller · ‎03-23-2016

Move the files to a different place, or mark them with an extension like .complete, when you're done writing them. Have Splunk only monitor that different place or whitelist only that extension.

You could fix the duplicates at search time, something like | dedup source, but don't expect blazing performance.
Automatically deleting events is really bad mojo.

View solution in original post

martin_mueller · ‎03-23-2016

Move the files to a different place, or mark them with an extension like .complete, when you're done writing them. Have Splunk only monitor that different place or whitelist only that extension.

You could fix the duplicates at search time, something like | dedup source, but don't expect blazing performance.
Automatically deleting events is really bad mojo.

martin_mueller · ‎03-25-2016

Modifying the cronjob/whatever that moves the files there is by far the best approach.
In general, it's best to avoid a mess than making a mess and making another mess trying to clean up the first mess.

Deleting duplicates automatically is entirely un-splunky. Running frequent calls to the | delete command is going to severely mess up your buckets, and doesn't reclaim space. Ways to reclaim space (age retention, space restrictions, cleaning entire index) all aren't selective enough for your case.

ishaanshekhar · ‎03-25-2016

Thanks @martin_mueller!

I can't go for the .complete suggestion for a few reasons.

Your second suggestion is nice but the search may extend for longer periods, such as over a few months and that makes the searches too slow to load.

How about deleting duplicate events automatically? You mentioned it is not a good option but is there any other option in my case?

Thanks
Ishaan

index entire file as a single event but avoid duplicate indexing

Enhance Security Visibility with Splunk Enterprise Security 7.1 through Threat ...

Troubleshooting the OpenTelemetry Collector

Adoption of Infrastructure Monitoring at Splunk