Getting Data In

How to troubleshoot why a universal forwarder is forwarding duplicate events for monitored CSV files?

dhavamanis
Builder

We are processing CSV files to index in Splunk, but the Splunk forwarder is always forwarding files twice. Can you please guide us how to avoid this duplicate indexing?

If we keep low number of files in the directory, we dont see the duplicate indexing, if number of files are huge, its showing duplicate entries, because we are doing rsync from google storage and its causing this issue. currently we have more than 90000 csv files in a directory. can you please suggest how to handle this case.

Forwarder config:

inputs.conf

[monitor:///opt/apps/appdata/apps/test/]
index=mobileapps
sourcetype=mobilegpcsv
crcSalt = 
whitelist = \.csv$

props.conf

[mobilegpcsv]
CHARSET=UCS-2-INTERNAL
INDEXED_EXTRACTIONS=csv
KV_MODE=none
NO_BINARY_CHECK=true
SHOULD_LINEMERGE=false
TIMESTAMP_FIELDS=Date
TIME_FORMAT=%Y-%m-%d
disabled=false
pulldown_type=true

Also noticed this in the splunkd.log file,

04-11-2016 12:39:00.266 -0400 WARN  UTF8Processor - Using charset UTF-8, as the monitor is believed over the raw text which may be UCS-2-INTERNAL - data_source="/opt/apps/appdata/apps/test/installs_com.test.aaaa_201206_overview.csv", data_host="node.abc.xyz.com", data_sourcetype="mobilegpcsv"
04-11-2016 12:39:03.270 -0400 INFO  WatchedFile - Will begin reading at offset=0 for file='/opt/apps/appdata/apps/test/installs_com.test.aaaa_201206_overview.csv'.
04-11-2016 12:39:03.272 -0400 WARN  UTF8Processor - Using charset UTF-8, as the monitor is believed over the raw text which may be UCS-2-INTERNAL - data_source="/opt/apps/appdata/apps/test/installs_com.test.aaaa_201206_overview.csv", data_host="node.abc.xyz.com", data_sourcetype="mobilegpcsv"

Adding DEBUG logs:

04-11-2016 16:32:49.354 -0400 DEBUG WatchedFile - setting trailing nulls to false via 'true' or 'false' from conf'
04-11-2016 16:32:49.354 -0400 DEBUG WatchedFile -   Loading state from fishbucket.
04-11-2016 16:32:49.354 -0400 DEBUG WatchedFile -   Attempting to load indexed extractions config from conf=source::/opt/apps/appdata/apps/test/installs_com.test.aaaa_201206_overview.csv|host::node.abc.xyz.com|mobilegpcsv|6 ...
04-11-2016 16:32:49.354 -0400 DEBUG WatchedFile -   Loaded indexed extractions settings: mode=2 HEADER_FIELD_LINE_NUMBER=0 HEADER_FIELD_DELIMITER=',' HEADER_FIELD_QUOTE='"' FIELD_DELIMITER=',' FIELD_QUOTE='"'
04-11-2016 16:32:49.354 -0400 DEBUG WatchedFile -   Reading for CSV initCrc...
04-11-2016 16:32:49.354 -0400 DEBUG WatchedFile -   CSV initCrc: skip_bytes=127 at have_read=256. Note that                      skip_bytes might be different to the actual number of bytes skipped                      in the file because of utf-8 conversion during utf8Converter parsing.
04-11-2016 16:32:49.354 -0400 DEBUG WatchedFile -   Reading for CSV initCrc...
04-11-2016 16:32:49.354 -0400 DEBUG WatchedFile -   CSV initCrc: checksum_bytes=61 after consumed=67 at have_read=512.
04-11-2016 16:32:49.354 -0400 DEBUG WatchedFile -   Reading for CSV initCrc...
04-11-2016 16:32:49.354 -0400 DEBUG WatchedFile -   Reading for CSV initCrc...
04-11-2016 16:32:49.354 -0400 DEBUG WatchedFile -   Reading for CSV initCrc...
04-11-2016 16:32:49.354 -0400 DEBUG WatchedFile -   Reading for CSV initCrc...
04-11-2016 16:32:49.355 -0400 DEBUG WatchedFile -   Reading for CSV initCrc...
04-11-2016 16:32:49.355 -0400 DEBUG WatchedFile -   Reading for CSV initCrc...
04-11-2016 16:32:49.355 -0400 DEBUG WatchedFile -   Reading for CSV initCrc...
04-11-2016 16:32:49.355 -0400 DEBUG WatchedFile -   Reading for CSV initCrc...
04-11-2016 16:32:49.355 -0400 DEBUG WatchedFile -   Reading for CSV initCrc...
04-11-2016 16:32:49.355 -0400 DEBUG WatchedFile -   initcrc has changed to: 0x58eed70075603e5f.
04-11-2016 16:32:49.355 -0400 DEBUG WatchedFile - Record found, will advance file by offset=4552 initcrc=0x58eed70075603e5f.
04-11-2016 16:32:49.355 -0400 DEBUG WatchedFile -   Attempting to load indexed extractions config from conf=source::/opt/apps/appdata/apps/test/installs_com.test.aaaa_201206_overview.csv|host::node.abc.xyz.com|mobilegpcsv|300 ...
04-11-2016 16:32:49.355 -0400 DEBUG WatchedFile -   Loaded indexed extractions settings: mode=2 HEADER_FIELD_LINE_NUMBER=0 HEADER_FIELD_DELIMITER=',' HEADER_FIELD_QUOTE='"' FIELD_DELIMITER=',' FIELD_QUOTE='"'
04-11-2016 16:32:49.355 -0400 DEBUG WatchedFile - min_batch_size_bytes set to 20971520
04-11-2016 16:32:49.355 -0400 DEBUG WatchedFile - seeking /opt/apps/appdata/apps/test/installs_com.test.aaaa_201206_overview.csv to off=4552
04-11-2016 16:32:49.355 -0400 INFO  WatchedFile - Resetting fd to re-extract header.
04-11-2016 16:32:49.355 -0400 DEBUG WatchedFile -   Saving off=4552 before processing header...
04-11-2016 16:32:49.355 -0400 DEBUG WatchedFile - seeking /opt/apps/appdata/apps/test/installs_com.test.aaaa_201206_overview.csv to off=0
04-11-2016 16:32:49.355 -0400 DEBUG WatchedFile -   Loaded structured data settings: configured=1 mode=2 HEADER_FIELD_LINE_NUMBER=0 HEADER_FIELD_DELIMITER=',' HEADER_FIELD_QUOTE='"' FIELD_DELIMITER=',' FIELD_QUOTE='"'.
04-11-2016 16:32:49.355 -0400 DEBUG WatchedFile -   Restoring off=4552 after processing header.
04-11-2016 16:32:49.355 -0400 DEBUG WatchedFile - seeking /opt/apps/appdata/apps/test/installs_com.test.aaaa_201206_overview.csv to off=4552
04-11-2016 16:32:49.355 -0400 DEBUG WatchedFile - Reached EOF: /opt/apps/appdata/apps/test/installs_com.test.aaaa_201206_overview.csv (read 0 bytes)
04-11-2016 16:32:49.356 -0400 DEBUG WatchedFile - setting trailing nulls to false via 'true' or 'false' from conf'
0 Karma

yannK
Splunk Employee
Splunk Employee

Look like your CSV was modified, and the first lines changes.
Hence the "initcrc" changed and the file was detected as new.

initcrc has changed to: 0x58eed70075603e5f.

0 Karma

dhavamanis
Builder

Thanks, If we keep low number of files in the directory, we dont see the duplicate indexing, if number of files are huge, its showing duplicate entries, because we are doing rsync from google storage and its causing this issue. currently we have more than 90000 csv files in a directory. can you please suggest how to handle this case.

0 Karma
Get Updates on the Splunk Community!

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...

Let’s Get You Certified – Vegas-Style at .conf24

Are you ready to level up your Splunk game? Then, let’s get you certified live at .conf24 – our annual user ...