Getting Data In
Highlighted

Why is Splunk is indexing duplicate data with my current universal forwarder configuration?

Path Finder

I'm setting up Splunk Infrastructure and one of the issues I'm facing is the duplicate data. I've tested this in my test environment which is running a Splunk Universal forwarder which forwards data to an Indexer.
The Splunk Universal forwarder is set to monitor just one directory. I've confirmed duplication by clearing all the indexes. Copied the log file to the monitored location, noted down the number of events which showed up in the indexer search and few minutes later, I copied the same file in the same location, now number of events are exactly x2 of the original number of events.
This is how my inputs.conf file looks like on UF.

[splunk@splunk_universalforwarder local]$ cat inputs.conf 
[monitor:///apps/webdata/splunkdata]
host_regex = /apps/webdata/splunkdata/(\w+)
disabled = false
index = main
sourcetype = stblogs

My understanding is Splunk by default is set to not index duplicate data. Why isn't it doing it in my case? What could I do to fix this? Thanks

Highlighted

Re: Why is Splunk is indexing duplicate data with my current universal forwarder configuration?

Contributor
Add crcSalt=<SOURCE> in the inputs.conf configuration
Highlighted

Re: Why is Splunk is indexing duplicate data with my current universal forwarder configuration?

Path Finder

Just to confirm, I need to do this on Splunk forwarders input.conf right?

0 Karma
Highlighted

Re: Why is Splunk is indexing duplicate data with my current universal forwarder configuration?

Path Finder

Could someone suggest a possible fix for this? Its causing delay in taking Splunk to live/production environment.
Thanks

0 Karma
Highlighted

Re: Why is Splunk is indexing duplicate data with my current universal forwarder configuration?

Motivator

helo Mr shahzadarif to do it see this link:

http://answers.splunk.com/answers/210739/why-is-my-forwarder-sending-data-from-monitored-fi.html

or , while waiting for a better solution, let met tell you that you can also do it after indexing:
1- after identifying the duplicated event or file.
2-build a query that fetch what you want to remove and pipe it with delete.
3- you can scheduled that search to run periodically.


or
you can use dedup command to filter out event duplicates without adding every field in the dedup command
ex:

index-your_index_name sourdedup _raw  sourcetype = stblogs    ...|dedup _raw|
0 Karma
Highlighted

Re: Why is Splunk is indexing duplicate data with my current universal forwarder configuration?

Path Finder

Thanks for providing the link.
I don't think it applies to my scenario. The logs files which we're processing in Splunk are parsed using a Python script on Splunk forwarder nodes. So we don't have Windows or Samba in our environment.
If it helps this is what my corresponding props.conf file looks like:

bash-4.1$ cat ./etc/apps/search/local/props.conf
[stblogs]
NOBINARYCHECK = true
category = Custom
disabled = false
pulldowntype = true
TIME
FORMAT = %Y:%m:%d %H:%M:%S
TIMEPREFIX = date=[
description = LineBreak-Timestamp
SHOULD
LINEMERGE = true
BREAKONLYBEFORE = date=

0 Karma
Highlighted

Re: Why is Splunk is indexing duplicate data with my current universal forwarder configuration?

Motivator

If you could, modify $SPLUNKHOME/etc/apps/splunkdeployment_monitor/default/macros.conf and change this:

[forwardermetrics]
definition = index="
internal" source="metrics.lo" group=tcpinconnections | eval sourceHost=if(isnull(hostname), sourceHost,hostname) | eval connectionType=case(fwdType=="uf","universal forwarder", fwdType=="lwf", "lightweight forwarder",fwdType=="full", "heavy forwarder", connectionType=="cooked" or connectionType=="cookedSSL","Splunk forwarder", connectionType=="raw" or connectionType=="rawSSL","legacy forwarder")| eval build=if(isnull(build),"n/a",build) | eval version=if(isnull(version),"pre 4.2",version) | eval guid=if(isnull(guid),sourceHost,guid) | eval os=if(isnull(os),"n/a",os)| eval arch=if(isnull(arch),"n/a",arch) | fields connectionType sourceIp sourceHost sourcePort destPort kb tcpeps tcpKprocessed tcpKBps splunk_server build version os arch guid

To this:
[forwardermetrics]
definition = index="
internal" source="metrics.lo" group=tcpinconnections NOT eventType=* | eval sourceHost=if(isnull(hostname), sourceHost,hostname) | eval connectionType=case(fwdType=="uf","universal forwarder", fwdType=="lwf", "lightweight forwarder",fwdType=="full", "heavy forwarder", connectionType=="cooked" or connectionType=="cookedSSL","Splunk forwarder", connectionType=="raw" or connectionType=="rawSSL","legacy forwarder")| eval build=if(isnull(build),"n/a",build) | eval version=if(isnull(version),"pre 4.2",version) | eval guid=if(isnull(guid),sourceHost,guid) | eval os=if(isnull(os),"n/a",os)| eval arch=if(isnull(arch),"n/a",arch) | fields connectionType sourceIp sourceHost sourcePort destPort kb tcpeps tcpKprocessed tcpKBps splunk_server build version os arch guid

0 Karma
Highlighted

Re: Why is Splunk is indexing duplicate data with my current universal forwarder configuration?

Motivator

Hi, while waiting for a better solution, let met tell you that you can do it after indexing:
1- after identifying the duplicated event or file.
2-build a query that fetch what you want to remove and pipe it with delete.
3- you can scheduled that search to run periodically.


or
you can use dedup command to filter out event duplicates without adding every field in the dedup command
ex:

index-yourindexname sourdedup _raw sourcetype = stblogs ...|dedup _raw|

0 Karma
Highlighted

Re: Why is Splunk is indexing duplicate data with my current universal forwarder configuration?

Path Finder

I don't have macros.conf file in the location you've mentioned.
I'm afraid second solution won't work in my case. I'll might be getting hundred or duplicate files a day and while I ran Splunk for few days, every single day we had exceeded our Splunk licence limit of 100GB by some margin and it was due to all the duplicates Splunk had indexed.
I don't understand why Splunk's default behaviour of not indexing duplicate data isn't working in my case? In my test cases nothing has changed in the files which were indexed more than once, exact same data with exact same file name. Is there a good documentation on how Splunk figures out what has already been indexed?

0 Karma
Highlighted

Re: Why is Splunk is indexing duplicate data with my current universal forwarder configuration?

Path Finder

I've resolved the issue.
Our Python script was generating sub-directories under the monitored directory and all the files were getting saved under those sub-directories. Getting rid of those sub-directories has sorted the issue out.
Thanks for all your help everyone.