Getting Data In

How to prevent data duplication on a csv file ?

dm1
Contributor

I have a scheduled search that outputs the results every 5 minutes using the outputcsv command to local disk. The file is stored with name abc_dns.csv  

 

 

index=abc |fields _time _raw |fields - _indextime _sourcetype _subsecond |outputcsv abc_dns

 

 

Then I am forwarding that file to an external Indexer

inputs.conf

 

[monitor:///opt/splunk/var/run/splunk/csv/abc_dns.csv]
index = abc_dns_logs
sourcetype = abc_dns
#crcSalt = <SOURCE>

 

Below is the props.conf

 

[abc_dns]
INDEXED_EXTRACTIONS = csv
HEADER_FIELD_LINE_NUMBER = 1
KV_MODE = none
NO_BINARY_CHECK = true
SHOULD_LINEMERGE = false
category = structured
TRANSFORMS-t1 = eliminate_header

 

transforms.conf

 

[eliminate_header]
REGEX = ^"_time","_raw"$
DEST_KEY = queue
FORMAT = nullQueue

 

 

When I validate the results, I am seeing data is getting duplicated on the external Indexer.

I attempted to add crcSalt = <SOURCE> to check if it makes any difference, which seemed that it did initially, however, afterwhile, I saw data was getting duplicated again. In reality, there is indeed duplicate data in original logs itself, but overall I am actually seeing data from the monitored file is also getting duplicated.

Can anyone please help with this ?

Tags (1)
0 Karma

anilchaithu
Builder

@dm1 

If the csv file generated from the search has duplicate rows, indexer will index them as is. You need to remove duplicates in your search.

Please try this out.

index=abc |fields _time _raw |fields - _indextime _sourcetype _subsecond | dedup field_names* |outputcsv abc_dns

 

-- Hope this helps

0 Karma
Got questions? Get answers!

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Meet up IRL or virtually!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Get Updates on the Splunk Community!

[Puzzles] Solve, Learn, Repeat: Character substitutions with Regular Expressions

This challenge was first posted on Slack #puzzles channelFor BORE at .conf23, we had a puzzle question which ...

Splunk Community Badges!

  Hey everyone! Ready to earn some serious bragging rights in the community? Along with our existing badges ...

[Puzzles] Solve, Learn, Repeat: Matching cron expressions

This puzzle (first published here) is based on matching timestamps to cron expressions.All the timestamps ...