Getting Data In

Experiencing duplicate indexing- How do I solve this problem?

lostcauz3
Path Finder

I have a directory that is being monitored on a splunk heavy forwarder.

/app_monitoring      

The above directory will receive a file everyday called Report.csv

there may be duplicate data in it that is already indexed, how to prevent duplicate indexing in this case?

do i have to change anything in the inputs.conf in the app folder? please advise.



Labels (2)
0 Karma
1 Solution

PickleRick
SplunkTrust
SplunkTrust

There is no built-in deduplication of events on ingestion. As simple as that. If you receive or read the same event twice, it will be ingested and indexed twice. As simple as that.

Having said that - the file monitoring input does remember a hash of a file along with how far it already read it so it will not re-read each monitored file after every restart. The file hash is calculated from the beginning of the file so it stays the same even after some data is appended to the file. And even if the file is renamed (for example - by logrotate), the checksum calculated from the beginning of the file stays the same so the file will not be read again. Adding the crcsalt=<SOURCE> adds a filename to the calculated checksum so two files with different names but the same checksum calculated from the beginning of the file would both get indexed.

View solution in original post

0 Karma

PickleRick
SplunkTrust
SplunkTrust

There is no built-in deduplication of events on ingestion. As simple as that. If you receive or read the same event twice, it will be ingested and indexed twice. As simple as that.

Having said that - the file monitoring input does remember a hash of a file along with how far it already read it so it will not re-read each monitored file after every restart. The file hash is calculated from the beginning of the file so it stays the same even after some data is appended to the file. And even if the file is renamed (for example - by logrotate), the checksum calculated from the beginning of the file stays the same so the file will not be read again. Adding the crcsalt=<SOURCE> adds a filename to the calculated checksum so two files with different names but the same checksum calculated from the beginning of the file would both get indexed.

0 Karma

gcusello
SplunkTrust
SplunkTrust

Hi @lostcauz3,

if you're speaking of duplicate files, Splunk doesn't index twice a file containing the same data, also with a different filename; inmstead Splunk cannot identify that some eventa are already present.

So you cannot discard duplicates before indexing, you can only dedup results in search.

Ciao.

Giuseppe

0 Karma

lostcauz3
Path Finder

if i add crcSalt = <SOURCE> to the inputs.conf file what will this do in my case ?

I'm very confused about this

0 Karma

gcusello
SplunkTrust
SplunkTrust

Hi @lostcauz3,

no crcSalt = <SOUCE> permits to index again an already indexed file.

it isn't possible to filter some already indexed logs.

Splunk doesn't index twice an intere file already indexed, not a part of it.

Ciao.

Giuseppe

0 Karma
Got questions? Get answers!

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Meet up IRL or virtually!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Get Updates on the Splunk Community!

Painting a Clearer Picture: Creating Cross-Domain Visibility with AI Canvas

    Thursday, June 25, 2026  |  11AM PDT / 2PM EDT  Duration: 1 Hour (Includes live Q&A) Register to ...

Analytics Workspace deprecation

As of Splunk Cloud Platform 10.4.2604 and Splunk Enterprise 10.4, Analytics Workspace is now deprecated. ...

Splunk Developer Day Recap: Building, Publishing, and Growing on the Splunk Platform

Splunk Developer Day brought the Splunk developer community together for a practical look at what it means to ...