Why is Splunk is indexing duplicate data with my c...

shahzadarif · ‎06-24-2015

I'm setting up Splunk Infrastructure and one of the issues I'm facing is the duplicate data. I've tested this in my test environment which is running a Splunk Universal forwarder which forwards data to an Indexer.
The Splunk Universal forwarder is set to monitor just one directory. I've confirmed duplication by clearing all the indexes. Copied the log file to the monitored location, noted down the number of events which showed up in the indexer search and few minutes later, I copied the same file in the same location, now number of events are exactly x2 of the original number of events.
This is how my inputs.conf file looks like on UF.

[splunk@splunk_universalforwarder local]$ cat inputs.conf 
[monitor:///apps/webdata/splunkdata]
host_regex = /apps/webdata/splunkdata/(\w+)
disabled = false
index = main
sourcetype = stblogs

My understanding is Splunk by default is set to not index duplicate data. Why isn't it doing it in my case? What could I do to fix this? Thanks

shahzadarif · ‎07-08-2015

I've resolved the issue.
Our Python script was generating sub-directories under the monitored directory and all the files were getting saved under those sub-directories. Getting rid of those sub-directories has sorted the issue out.
Thanks for all your help everyone.

fdi01 · ‎06-29-2015

helo Mr shahzadarif to do it see this link:

http://answers.splunk.com/answers/210739/why-is-my-forwarder-sending-data-from-monitored-fi.html

or , while waiting for a better solution, let met tell you that you can also do it after indexing:
1- after identifying the duplicated event or file.
2-build a query that fetch what you want to remove and pipe it with delete.
3- you can scheduled that search to run periodically.

or
you can use dedup command to filter out event duplicates without adding every field in the dedup command
ex:

index-your_index_name sourdedup _raw  sourcetype = stblogs    ...|dedup _raw|

shahzadarif · ‎06-30-2015

I don't have macros.conf file in the location you've mentioned.
I'm afraid second solution won't work in my case. I'll might be getting hundred or duplicate files a day and while I ran Splunk for few days, every single day we had exceeded our Splunk licence limit of 100GB by some margin and it was due to all the duplicates Splunk had indexed.
I don't understand why Splunk's default behaviour of not indexing duplicate data isn't working in my case? In my test cases nothing has changed in the files which were indexed more than once, exact same data with exact same file name. Is there a good documentation on how Splunk figures out what has already been indexed?

shahzadarif · ‎06-29-2015

Thanks for providing the link.
I don't think it applies to my scenario. The logs files which we're processing in Splunk are parsed using a Python script on Splunk forwarder nodes. So we don't have Windows or Samba in our environment.
If it helps this is what my corresponding props.conf file looks like:

bash-4.1$ cat ./etc/apps/search/local/props.conf
[stblogs]
NO_BINARY_CHECK = true
category = Custom
disabled = false
pulldown_type = true
TIME_FORMAT = %Y:%m:%d %H:%M:%S
TIME_PREFIX = date=[
description = LineBreak-Timestamp
SHOULD_LINEMERGE = true
BREAK_ONLY_BEFORE = date=

fdi01 · ‎06-29-2015

If you could, modify $SPLUNK_HOME/etc/apps/splunk_deployment_monitor/default/macros.conf and change this:

[forwarder_metrics]
definition = index="_internal" source="metrics.lo" group=tcpin_connections | eval sourceHost=if(isnull(hostname), sourceHost,hostname) | eval connectionType=case(fwdType=="uf","universal forwarder", fwdType=="lwf", "lightweight forwarder",fwdType=="full", "heavy forwarder", connectionType=="cooked" or connectionType=="cookedSSL","Splunk forwarder", connectionType=="raw" or connectionType=="rawSSL","legacy forwarder")| eval build=if(isnull(build),"n/a",build) | eval version=if(isnull(version),"pre 4.2",version) | eval guid=if(isnull(guid),sourceHost,guid) | eval os=if(isnull(os),"n/a",os)| eval arch=if(isnull(arch),"n/a",arch) | fields connectionType sourceIp sourceHost sourcePort destPort kb tcp_eps tcp_Kprocessed tcp_KBps splunk_server build version os arch guid

To this:
[forwarder_metrics]
definition = index="_internal" source="metrics.lo" group=tcpin_connections NOT eventType=* | eval sourceHost=if(isnull(hostname), sourceHost,hostname) | eval connectionType=case(fwdType=="uf","universal forwarder", fwdType=="lwf", "lightweight forwarder",fwdType=="full", "heavy forwarder", connectionType=="cooked" or connectionType=="cookedSSL","Splunk forwarder", connectionType=="raw" or connectionType=="rawSSL","legacy forwarder")| eval build=if(isnull(build),"n/a",build) | eval version=if(isnull(version),"pre 4.2",version) | eval guid=if(isnull(guid),sourceHost,guid) | eval os=if(isnull(os),"n/a",os)| eval arch=if(isnull(arch),"n/a",arch) | fields connectionType sourceIp sourceHost sourcePort destPort kb tcp_eps tcp_Kprocessed tcp_KBps splunk_server build version os arch guid

fdi01 · ‎06-29-2015

Hi, while waiting for a better solution, let met tell you that you can do it after indexing:
1- after identifying the duplicated event or file.
2-build a query that fetch what you want to remove and pipe it with delete.
3- you can scheduled that search to run periodically.

or
you can use dedup command to filter out event duplicates without adding every field in the dedup command
ex:

index-your_index_name sourdedup _raw sourcetype = stblogs ...|dedup _raw|

shahzadarif · ‎06-29-2015

Could someone suggest a possible fix for this? Its causing delay in taking Splunk to live/production environment.
Thanks

srinathd · ‎06-25-2015

Add crcSalt=<SOURCE> in the inputs.conf configuration

shahzadarif · ‎06-25-2015

Just to confirm, I need to do this on Splunk forwarders input.conf right?

Why is Splunk is indexing duplicate data with my current universal forwarder configuration?

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

From Data to Insight: Announcing the Winners of the Splunk Dashboard Contest

Splunk Developers: Construct Your Future at the .conf26 Builder Bar

Quick connection discovery mode for forwarders

Join the Conversation