Getting Data In

File ingested generates another file with the same data which results in data duplication

ejmin
Path Finder

Hi Splunk Experts I have this kind of problem which confuses me. The file being ingested generates another file which has a different filename format but contains the same data. Please see the examples of the data being generated below. (I dont want to use dedup function as well as the scheduled query with piping delete function because it doesnt resolve the root cause problem)

/opt/splunk/CompanyX/DAILY/JUL-20_GPW_DAILY_2020_AS_OF_07212020.txt (original)

/opt/splunk/CompanyX/DAILY/.JUL-20_GPW_DAILY_2020_AS_OF_07212020.txt.tokUbm (generated)

/opt/splunk/CompanyX/DAILY/JUL-19_GPW_DAILY_2020_AS_OF_07202020.txt (original)

/opt/splunk/CompanyX/DAILY/.JUL-19_GPW_DAILY_2020_AS_OF_07202020.txt.MjSIIF(generated)

/opt/splunk/CompanyX/DAILY/JUL-18_GPW_DAILY_2020_AS_OF_07192020.txt (original)

/opt/splunk/CompanyX/DAILY/.JUL-18_GPW_DAILY_2020_AS_OF_07192020.txt.nO9Y5C(generated)

 

The extraction happens on midnight and goes to a certain directory in which the script replicates into the splunk indexer instance. 

My configuration on  inputs.conf:

[monitor:///opt/splunk/CompanyX/DAILY/*]

disabled = false

index=gpw_daily

sourcetype=gpw_csv

crcSalt=<SOURCE>

 

The configuration is working properly for the past year and this incident only happens this past week.  So if anyone has encountered this problem please help me to resolve it. Thanks

 

0 Karma
1 Solution

ejmin
Path Finder

Hi @richgalloway ,

Thanks for the reply, The duplication happened this past week. It starts generating some .txt.(random) letters but for this past year seems fine. I wonder how it happen because when I searched in the _internal and found its _indextime the ingestion happens within a second interval and produces different bytes according to logs.

But anyway if I dont find the root cause for this problem Ill consider to use your whitelist key-value settings. Thanks Ill consider it as a temporary solution but not the root cause solution. 

View solution in original post

0 Karma

richgalloway
SplunkTrust
SplunkTrust

What changed in the past week?

Consider modifying the inputs.conf file to reduce duplication.  Either this

[monitor:///opt/splunk/CompanyX/DAILY/*.txt]
disabled = false
index=gpw_daily
sourcetype=gpw_csv
crcSalt=<SOURCE>

or this

[monitor:///opt/splunk/CompanyX/DAILY/*]
disabled = false
index=gpw_daily
sourcetype=gpw_csv
crcSalt=<SOURCE>
whitelist = *.txt

 

---
If this reply helps you, Karma would be appreciated.

ejmin
Path Finder

Hi @richgalloway ,

Thanks for the reply, The duplication happened this past week. It starts generating some .txt.(random) letters but for this past year seems fine. I wonder how it happen because when I searched in the _internal and found its _indextime the ingestion happens within a second interval and produces different bytes according to logs.

But anyway if I dont find the root cause for this problem Ill consider to use your whitelist key-value settings. Thanks Ill consider it as a temporary solution but not the root cause solution. 

0 Karma
Got questions? Get answers!

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Meet up IRL or virtually!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Get Updates on the Splunk Community!

May 2026 Splunk Expert Sessions: Security & Observability

Level Up Your Operations: May 2026 Splunk Expert Sessions Whether you are refining your security posture or ...

Network to App: Observability Unlocked [May & June Series]

In today’s digital landscape, your environment is no longer confined to the data center. It spans complex ...

SPL2 Deep Dives, AppDynamics Integrations, SAML Made Simple and Much More on Splunk ...

Splunk Lantern is Splunk’s customer success center that provides practical guidance from Splunk experts on key ...