topic Re: Why is Splunk is indexing duplicate data with my current universal forwarder configuration? in Getting Data In

Why is Splunk is indexing duplicate data with my current universal forwarder configuration?

shahzadarif — Thu, 25 Jun 2015 04:48:52 GMT

I'm setting up Splunk Infrastructure and one of the issues I'm facing is the duplicate data. I've tested this in my test environment which is running a Splunk Universal forwarder which forwards data to an Indexer.
The Splunk Universal forwarder is set to monitor just one directory. I've confirmed duplication by clearing all the indexes. Copied the log file to the monitored location, noted down the number of events which showed up in the indexer search and few minutes later, I copied the same file in the same location, now number of events are exactly x2 of the original number of events.
This is how my inputs.conf file looks like on UF.

[splunk@splunk_universalforwarder local]$ cat inputs.conf 
[monitor:///apps/webdata/splunkdata]
host_regex = /apps/webdata/splunkdata/(\w+)
disabled = false
index = main
sourcetype = stblogs

My understanding is Splunk by default is set to not index duplicate data. Why isn't it doing it in my case? What could I do to fix this? Thanks

Re: Why is Splunk is indexing duplicate data with my current universal forwarder configuration?

srinathd — Thu, 25 Jun 2015 13:50:00 GMT

Add crcSalt=<SOURCE> in the inputs.conf configuration

Re: Why is Splunk is indexing duplicate data with my current universal forwarder configuration?

shahzadarif — Thu, 25 Jun 2015 19:33:51 GMT

Just to confirm, I need to do this on Splunk forwarders input.conf right?

Re: Why is Splunk is indexing duplicate data with my current universal forwarder configuration?

shahzadarif — Mon, 29 Jun 2015 08:24:37 GMT

Could someone suggest a possible fix for this? Its causing delay in taking Splunk to live/production environment.
Thanks

Re: Why is Splunk is indexing duplicate data with my current universal forwarder configuration?

fdi01 — Mon, 29 Jun 2015 09:20:42 GMT

helo Mr shahzadarif to do it see this link:

http://answers.splunk.com/answers/210739/why-is-my-forwarder-sending-data-from-monitored-fi.html

or , while waiting for a better solution, let met tell you that you can also do it after indexing:
1- after identifying the duplicated event or file.
2-build a query that fetch what you want to remove and pipe it with delete.
3- you can scheduled that search to run periodically.

or
you can use dedup command to filter out event duplicates without adding every field in the dedup command
ex:

index-your_index_name sourdedup _raw  sourcetype = stblogs    ...|dedup _raw|

Re: Why is Splunk is indexing duplicate data with my current universal forwarder configuration?

shahzadarif — Mon, 28 Sep 2020 20:24:11 GMT

Thanks for providing the link.
I don't think it applies to my scenario. The logs files which we're processing in Splunk are parsed using a Python script on Splunk forwarder nodes. So we don't have Windows or Samba in our environment.
If it helps this is what my corresponding props.conf file looks like:

bash-4.1$ cat ./etc/apps/search/local/props.conf
[stblogs]
NO_BINARY_CHECK = true
category = Custom
disabled = false
pulldown_type = true
TIME_FORMAT = %Y:%m:%d %H:%M:%S
TIME_PREFIX = date=[
description = LineBreak-Timestamp
SHOULD_LINEMERGE = true
BREAK_ONLY_BEFORE = date=

Re: Why is Splunk is indexing duplicate data with my current universal forwarder configuration?

fdi01 — Mon, 28 Sep 2020 20:24:36 GMT

If you could, modify $SPLUNK_HOME/etc/apps/splunk_deployment_monitor/default/macros.conf and change this:

[forwarder_metrics]
definition = index="_internal" source="metrics.lo" group=tcpin_connections | eval sourceHost=if(isnull(hostname), sourceHost,hostname) | eval connectionType=case(fwdType=="uf","universal forwarder", fwdType=="lwf", "lightweight forwarder",fwdType=="full", "heavy forwarder", connectionType=="cooked" or connectionType=="cookedSSL","Splunk forwarder", connectionType=="raw" or connectionType=="rawSSL","legacy forwarder")| eval build=if(isnull(build),"n/a",build) | eval version=if(isnull(version),"pre 4.2",version) | eval guid=if(isnull(guid),sourceHost,guid) | eval os=if(isnull(os),"n/a",os)| eval arch=if(isnull(arch),"n/a",arch) | fields connectionType sourceIp sourceHost sourcePort destPort kb tcp_eps tcp_Kprocessed tcp_KBps splunk_server build version os arch guid

To this:
[forwarder_metrics]
definition = index="_internal" source="metrics.lo" group=tcpin_connections NOT eventType=* | eval sourceHost=if(isnull(hostname), sourceHost,hostname) | eval connectionType=case(fwdType=="uf","universal forwarder", fwdType=="lwf", "lightweight forwarder",fwdType=="full", "heavy forwarder", connectionType=="cooked" or connectionType=="cookedSSL","Splunk forwarder", connectionType=="raw" or connectionType=="rawSSL","legacy forwarder")| eval build=if(isnull(build),"n/a",build) | eval version=if(isnull(version),"pre 4.2",version) | eval guid=if(isnull(guid),sourceHost,guid) | eval os=if(isnull(os),"n/a",os)| eval arch=if(isnull(arch),"n/a",arch) | fields connectionType sourceIp sourceHost sourcePort destPort kb tcp_eps tcp_Kprocessed tcp_KBps splunk_server build version os arch guid

Re: Why is Splunk is indexing duplicate data with my current universal forwarder configuration?

fdi01 — Mon, 28 Sep 2020 20:24:39 GMT

Hi, while waiting for a better solution, let met tell you that you can do it after indexing:
1- after identifying the duplicated event or file.
2-build a query that fetch what you want to remove and pipe it with delete.
3- you can scheduled that search to run periodically.

or
you can use dedup command to filter out event duplicates without adding every field in the dedup command
ex:

index-your_index_name sourdedup _raw sourcetype = stblogs ...|dedup _raw|

Re: Why is Splunk is indexing duplicate data with my current universal forwarder configuration?

shahzadarif — Wed, 01 Jul 2015 05:24:24 GMT

I don't have macros.conf file in the location you've mentioned.
I'm afraid second solution won't work in my case. I'll might be getting hundred or duplicate files a day and while I ran Splunk for few days, every single day we had exceeded our Splunk licence limit of 100GB by some margin and it was due to all the duplicates Splunk had indexed.
I don't understand why Splunk's default behaviour of not indexing duplicate data isn't working in my case? In my test cases nothing has changed in the files which were indexed more than once, exact same data with exact same file name. Is there a good documentation on how Splunk figures out what has already been indexed?

Re: Why is Splunk is indexing duplicate data with my current universal forwarder configuration?

shahzadarif — Wed, 08 Jul 2015 10:35:25 GMT

I've resolved the issue.
Our Python script was generating sub-directories under the monitored directory and all the files were getting saved under those sub-directories. Getting rid of those sub-directories has sorted the issue out.
Thanks for all your help everyone.