<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Why is Splunk is indexing duplicate data with my current universal forwarder configuration? in Getting Data In</title>
    <link>https://community.splunk.com/t5/Getting-Data-In/Why-is-Splunk-is-indexing-duplicate-data-with-my-current/m-p/175091#M35164</link>
    <description>&lt;P&gt;Hi, while waiting for a better solution, let met tell you that you can do it after indexing:&lt;BR /&gt;
1- after identifying the duplicated event or file.&lt;BR /&gt;
2-build a query that fetch what you want to remove and pipe it with delete.&lt;BR /&gt;
3- you can scheduled that search to run periodically.&lt;/P&gt;

&lt;HR /&gt;

&lt;P&gt;or&lt;BR /&gt;
you can use &lt;CODE&gt;dedup&lt;/CODE&gt; command to filter out event duplicates without adding every field in the dedup command&lt;BR /&gt;
ex:&lt;/P&gt;

&lt;P&gt;index-your_index_name sourdedup _raw  sourcetype = stblogs    ...|dedup _raw|&lt;/P&gt;</description>
    <pubDate>Mon, 28 Sep 2020 20:24:39 GMT</pubDate>
    <dc:creator>fdi01</dc:creator>
    <dc:date>2020-09-28T20:24:39Z</dc:date>
    <item>
      <title>Why is Splunk is indexing duplicate data with my current universal forwarder configuration?</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Why-is-Splunk-is-indexing-duplicate-data-with-my-current/m-p/175084#M35157</link>
      <description>&lt;P&gt;I'm setting up Splunk Infrastructure and one of the issues I'm facing is the duplicate data. I've tested this in my test environment which is running a Splunk Universal forwarder which forwards data to an Indexer.&lt;BR /&gt;
The Splunk Universal forwarder is set to monitor just one directory. I've confirmed duplication by clearing all the indexes. Copied the log file to the monitored location, noted down the number of events which showed up in the indexer search and few minutes later, I copied the same file in the same location, now number of events are exactly x2 of the original number of events.&lt;BR /&gt;
This is how my inputs.conf file looks like on UF.&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;[splunk@splunk_universalforwarder local]$ cat inputs.conf 
[monitor:///apps/webdata/splunkdata]
host_regex = /apps/webdata/splunkdata/(\w+)
disabled = false
index = main
sourcetype = stblogs
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;My understanding is Splunk by default is set to not index duplicate data. Why isn't it doing it in my case? What could I do to fix this? Thanks&lt;/P&gt;</description>
      <pubDate>Thu, 25 Jun 2015 04:48:52 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Why-is-Splunk-is-indexing-duplicate-data-with-my-current/m-p/175084#M35157</guid>
      <dc:creator>shahzadarif</dc:creator>
      <dc:date>2015-06-25T04:48:52Z</dc:date>
    </item>
    <item>
      <title>Re: Why is Splunk is indexing duplicate data with my current universal forwarder configuration?</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Why-is-Splunk-is-indexing-duplicate-data-with-my-current/m-p/175085#M35158</link>
      <description>&lt;PRE&gt;&lt;CODE&gt;Add crcSalt=&amp;lt;SOURCE&amp;gt; in the inputs.conf configuration
&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Thu, 25 Jun 2015 13:50:00 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Why-is-Splunk-is-indexing-duplicate-data-with-my-current/m-p/175085#M35158</guid>
      <dc:creator>srinathd</dc:creator>
      <dc:date>2015-06-25T13:50:00Z</dc:date>
    </item>
    <item>
      <title>Re: Why is Splunk is indexing duplicate data with my current universal forwarder configuration?</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Why-is-Splunk-is-indexing-duplicate-data-with-my-current/m-p/175086#M35159</link>
      <description>&lt;P&gt;Just to confirm, I need to do this on Splunk forwarders input.conf right?&lt;/P&gt;</description>
      <pubDate>Thu, 25 Jun 2015 19:33:51 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Why-is-Splunk-is-indexing-duplicate-data-with-my-current/m-p/175086#M35159</guid>
      <dc:creator>shahzadarif</dc:creator>
      <dc:date>2015-06-25T19:33:51Z</dc:date>
    </item>
    <item>
      <title>Re: Why is Splunk is indexing duplicate data with my current universal forwarder configuration?</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Why-is-Splunk-is-indexing-duplicate-data-with-my-current/m-p/175087#M35160</link>
      <description>&lt;P&gt;Could someone suggest a possible fix for this? Its causing delay in taking Splunk to live/production environment.&lt;BR /&gt;
Thanks&lt;/P&gt;</description>
      <pubDate>Mon, 29 Jun 2015 08:24:37 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Why-is-Splunk-is-indexing-duplicate-data-with-my-current/m-p/175087#M35160</guid>
      <dc:creator>shahzadarif</dc:creator>
      <dc:date>2015-06-29T08:24:37Z</dc:date>
    </item>
    <item>
      <title>Re: Why is Splunk is indexing duplicate data with my current universal forwarder configuration?</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Why-is-Splunk-is-indexing-duplicate-data-with-my-current/m-p/175088#M35161</link>
      <description>&lt;P&gt;helo Mr  shahzadarif  to do it see this link:&lt;/P&gt;

&lt;P&gt;&lt;A href="http://answers.splunk.com/answers/210739/why-is-my-forwarder-sending-data-from-monitored-fi.html"&gt;http://answers.splunk.com/answers/210739/why-is-my-forwarder-sending-data-from-monitored-fi.html&lt;/A&gt; &lt;/P&gt;

&lt;P&gt;or , while waiting for a better solution, let met tell you that you can also do it after indexing:&lt;BR /&gt;
1- after identifying the duplicated event or file.&lt;BR /&gt;
2-build a query that fetch what you want to remove and pipe it with delete.&lt;BR /&gt;
3- you can scheduled that search to run periodically.&lt;/P&gt;

&lt;HR /&gt;

&lt;P&gt;or&lt;BR /&gt;
you can use &lt;CODE&gt;dedup&lt;/CODE&gt; command to filter out event duplicates without adding every field in the dedup command&lt;BR /&gt;
ex:&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;index-your_index_name sourdedup _raw  sourcetype = stblogs    ...|dedup _raw|
&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Mon, 29 Jun 2015 09:20:42 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Why-is-Splunk-is-indexing-duplicate-data-with-my-current/m-p/175088#M35161</guid>
      <dc:creator>fdi01</dc:creator>
      <dc:date>2015-06-29T09:20:42Z</dc:date>
    </item>
    <item>
      <title>Re: Why is Splunk is indexing duplicate data with my current universal forwarder configuration?</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Why-is-Splunk-is-indexing-duplicate-data-with-my-current/m-p/175089#M35162</link>
      <description>&lt;P&gt;Thanks for providing the link.&lt;BR /&gt;
I don't think it applies to my scenario. The logs files which we're processing in Splunk are parsed using a Python script on Splunk forwarder nodes. So we don't have Windows or Samba in our environment.&lt;BR /&gt;
If it helps this is what my corresponding props.conf file looks like:&lt;/P&gt;

&lt;P&gt;bash-4.1$ cat ./etc/apps/search/local/props.conf&lt;BR /&gt;
[stblogs]&lt;BR /&gt;
NO_BINARY_CHECK = true&lt;BR /&gt;
category = Custom&lt;BR /&gt;
disabled = false&lt;BR /&gt;
pulldown_type = true&lt;BR /&gt;
TIME_FORMAT = %Y:%m:%d %H:%M:%S&lt;BR /&gt;
TIME_PREFIX = date=[&lt;BR /&gt;
description = LineBreak-Timestamp&lt;BR /&gt;
SHOULD_LINEMERGE = true&lt;BR /&gt;
BREAK_ONLY_BEFORE = date=&lt;/P&gt;</description>
      <pubDate>Mon, 28 Sep 2020 20:24:11 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Why-is-Splunk-is-indexing-duplicate-data-with-my-current/m-p/175089#M35162</guid>
      <dc:creator>shahzadarif</dc:creator>
      <dc:date>2020-09-28T20:24:11Z</dc:date>
    </item>
    <item>
      <title>Re: Why is Splunk is indexing duplicate data with my current universal forwarder configuration?</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Why-is-Splunk-is-indexing-duplicate-data-with-my-current/m-p/175090#M35163</link>
      <description>&lt;P&gt;If you could, modify $SPLUNK_HOME/etc/apps/splunk_deployment_monitor/default/macros.conf and change this:&lt;/P&gt;

&lt;P&gt;[forwarder_metrics]&lt;BR /&gt;
definition = index="_internal" source="metrics.lo" group=tcpin_connections | eval sourceHost=if(isnull(hostname), sourceHost,hostname) | eval connectionType=case(fwdType=="uf","universal forwarder", fwdType=="lwf", "lightweight forwarder",fwdType=="full", "heavy forwarder", connectionType=="cooked" or connectionType=="cookedSSL","Splunk forwarder", connectionType=="raw" or connectionType=="rawSSL","legacy forwarder")| eval build=if(isnull(build),"n/a",build) | eval version=if(isnull(version),"pre 4.2",version) | eval guid=if(isnull(guid),sourceHost,guid) | eval os=if(isnull(os),"n/a",os)| eval arch=if(isnull(arch),"n/a",arch) | fields connectionType sourceIp sourceHost sourcePort destPort kb tcp_eps tcp_Kprocessed tcp_KBps splunk_server build version os arch guid&lt;/P&gt;

&lt;P&gt;To this:&lt;BR /&gt;
[forwarder_metrics]&lt;BR /&gt;
definition = index="_internal" source="metrics.lo" group=tcpin_connections NOT eventType=* | eval sourceHost=if(isnull(hostname), sourceHost,hostname) | eval connectionType=case(fwdType=="uf","universal forwarder", fwdType=="lwf", "lightweight forwarder",fwdType=="full", "heavy forwarder", connectionType=="cooked" or connectionType=="cookedSSL","Splunk forwarder", connectionType=="raw" or connectionType=="rawSSL","legacy forwarder")| eval build=if(isnull(build),"n/a",build) | eval version=if(isnull(version),"pre 4.2",version) | eval guid=if(isnull(guid),sourceHost,guid) | eval os=if(isnull(os),"n/a",os)| eval arch=if(isnull(arch),"n/a",arch) | fields connectionType sourceIp sourceHost sourcePort destPort kb tcp_eps tcp_Kprocessed tcp_KBps splunk_server build version os arch guid&lt;/P&gt;</description>
      <pubDate>Mon, 28 Sep 2020 20:24:36 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Why-is-Splunk-is-indexing-duplicate-data-with-my-current/m-p/175090#M35163</guid>
      <dc:creator>fdi01</dc:creator>
      <dc:date>2020-09-28T20:24:36Z</dc:date>
    </item>
    <item>
      <title>Re: Why is Splunk is indexing duplicate data with my current universal forwarder configuration?</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Why-is-Splunk-is-indexing-duplicate-data-with-my-current/m-p/175091#M35164</link>
      <description>&lt;P&gt;Hi, while waiting for a better solution, let met tell you that you can do it after indexing:&lt;BR /&gt;
1- after identifying the duplicated event or file.&lt;BR /&gt;
2-build a query that fetch what you want to remove and pipe it with delete.&lt;BR /&gt;
3- you can scheduled that search to run periodically.&lt;/P&gt;

&lt;HR /&gt;

&lt;P&gt;or&lt;BR /&gt;
you can use &lt;CODE&gt;dedup&lt;/CODE&gt; command to filter out event duplicates without adding every field in the dedup command&lt;BR /&gt;
ex:&lt;/P&gt;

&lt;P&gt;index-your_index_name sourdedup _raw  sourcetype = stblogs    ...|dedup _raw|&lt;/P&gt;</description>
      <pubDate>Mon, 28 Sep 2020 20:24:39 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Why-is-Splunk-is-indexing-duplicate-data-with-my-current/m-p/175091#M35164</guid>
      <dc:creator>fdi01</dc:creator>
      <dc:date>2020-09-28T20:24:39Z</dc:date>
    </item>
    <item>
      <title>Re: Why is Splunk is indexing duplicate data with my current universal forwarder configuration?</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Why-is-Splunk-is-indexing-duplicate-data-with-my-current/m-p/175092#M35165</link>
      <description>&lt;P&gt;I don't have macros.conf file in the location you've mentioned.&lt;BR /&gt;
I'm afraid second solution won't work in my case. I'll might be getting hundred or duplicate files a day and while I ran Splunk for few days, every single day we had exceeded our Splunk licence limit of 100GB by some margin and it was due to all the duplicates Splunk had indexed.&lt;BR /&gt;
I don't understand why Splunk's default behaviour of not indexing duplicate data isn't working in my case? In my test cases nothing has changed in the files which were indexed more than once, exact same data with exact same file name. Is there a good documentation on how Splunk figures out what has already been indexed?&lt;/P&gt;</description>
      <pubDate>Wed, 01 Jul 2015 05:24:24 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Why-is-Splunk-is-indexing-duplicate-data-with-my-current/m-p/175092#M35165</guid>
      <dc:creator>shahzadarif</dc:creator>
      <dc:date>2015-07-01T05:24:24Z</dc:date>
    </item>
    <item>
      <title>Re: Why is Splunk is indexing duplicate data with my current universal forwarder configuration?</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Why-is-Splunk-is-indexing-duplicate-data-with-my-current/m-p/175093#M35166</link>
      <description>&lt;P&gt;I've resolved the issue.&lt;BR /&gt;
Our Python script was generating sub-directories under the monitored directory and all the files were getting saved under those sub-directories. Getting rid of those sub-directories has sorted the issue out.&lt;BR /&gt;
Thanks for all your  help everyone.&lt;/P&gt;</description>
      <pubDate>Wed, 08 Jul 2015 10:35:25 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Why-is-Splunk-is-indexing-duplicate-data-with-my-current/m-p/175093#M35166</guid>
      <dc:creator>shahzadarif</dc:creator>
      <dc:date>2015-07-08T10:35:25Z</dc:date>
    </item>
  </channel>
</rss>

