We have files that are not being indexed, yet they are seen by Splunk. We have 38 files FTP'ed to a file folder which Splunk monitors every hour. Each hour, the previous 24 hours worth of data is dumped, just in case the job does not run as expected, this keeps us from having data loss. Being this way, we know that Splunk sees the old data as duplicate data, so we use this config to solve it:
sourcetype = meditech_npr
index = capsule_npr
crcSalt = <SOURCE>
Until the upgrade to Splunk 6/6.0.1, this has worked fine, it no longer appears to work though. It is of extreme importance that this issue is resolved immediately. Currently, I am having to delete the entire _thefishbucket index every few hours to ensure that data is getting indexed properly.