<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Experiencing duplicate indexing- How do I solve this problem? in Getting Data In</title>
    <link>https://community.splunk.com/t5/Getting-Data-In/Experiencing-duplicate-indexing-How-do-I-solve-this-problem/m-p/585772#M103037</link>
    <description>&lt;P&gt;There is no built-in deduplication of events on ingestion. As simple as that. If you receive or read the same event twice, it will be ingested and indexed twice. As simple as that.&lt;/P&gt;&lt;P&gt;Having said that - the file monitoring input does remember a hash of a file along with how far it already read it so it will not re-read each monitored file after every restart. The file hash is calculated from the beginning of the file so it stays the same even after some data is appended to the file. And even if the file is renamed (for example - by logrotate), the checksum calculated from the beginning of the file stays the same so the file will not be read again. Adding the crcsalt=&amp;lt;SOURCE&amp;gt; adds a filename to the calculated checksum so two files with different names but the same checksum calculated from the beginning of the file would both get indexed.&lt;/P&gt;</description>
    <pubDate>Fri, 18 Feb 2022 19:30:43 GMT</pubDate>
    <dc:creator>PickleRick</dc:creator>
    <dc:date>2022-02-18T19:30:43Z</dc:date>
    <item>
      <title>Experiencing duplicate indexing- How do I solve this problem?</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Experiencing-duplicate-indexing-How-do-I-solve-this-problem/m-p/585738#M103030</link>
      <description>&lt;P class="lia-align-left"&gt;I have a directory that is being monitored on a splunk heavy forwarder.&lt;BR /&gt;&lt;BR /&gt;/app_monitoring&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;The above directory will receive a file everyday called Report.csv&lt;BR /&gt;&lt;BR /&gt;there may be duplicate data in it that is already indexed, how to prevent duplicate indexing in this case?&lt;BR /&gt;&lt;BR /&gt;do i have to change anything in the inputs.conf in the app folder? please advise.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 18 Feb 2022 17:05:04 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Experiencing-duplicate-indexing-How-do-I-solve-this-problem/m-p/585738#M103030</guid>
      <dc:creator>lostcauz3</dc:creator>
      <dc:date>2022-02-18T17:05:04Z</dc:date>
    </item>
    <item>
      <title>Re: Duplicate indexing</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Experiencing-duplicate-indexing-How-do-I-solve-this-problem/m-p/585742#M103031</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.splunk.com/t5/user/viewprofilepage/user-id/240555"&gt;@lostcauz3&lt;/a&gt;,&lt;/P&gt;&lt;P&gt;if you're speaking of duplicate files, Splunk doesn't index twice a file containing the same data, also with a different filename; inmstead Splunk cannot identify that some eventa are already present.&lt;/P&gt;&lt;P&gt;So you cannot discard duplicates before indexing, you can only dedup results in search.&lt;/P&gt;&lt;P&gt;Ciao.&lt;/P&gt;&lt;P&gt;Giuseppe&lt;/P&gt;</description>
      <pubDate>Fri, 18 Feb 2022 17:07:04 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Experiencing-duplicate-indexing-How-do-I-solve-this-problem/m-p/585742#M103031</guid>
      <dc:creator>gcusello</dc:creator>
      <dc:date>2022-02-18T17:07:04Z</dc:date>
    </item>
    <item>
      <title>Re: Duplicate indexing</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Experiencing-duplicate-indexing-How-do-I-solve-this-problem/m-p/585746#M103033</link>
      <description>&lt;P&gt;if i add crcSalt = &amp;lt;SOURCE&amp;gt; to the inputs.conf file what will this do in my case ?&lt;/P&gt;&lt;P&gt;I'm very confused about this&lt;/P&gt;</description>
      <pubDate>Fri, 18 Feb 2022 17:12:47 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Experiencing-duplicate-indexing-How-do-I-solve-this-problem/m-p/585746#M103033</guid>
      <dc:creator>lostcauz3</dc:creator>
      <dc:date>2022-02-18T17:12:47Z</dc:date>
    </item>
    <item>
      <title>Re: Duplicate indexing</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Experiencing-duplicate-indexing-How-do-I-solve-this-problem/m-p/585747#M103034</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.splunk.com/t5/user/viewprofilepage/user-id/240555"&gt;@lostcauz3&lt;/a&gt;,&lt;/P&gt;&lt;P&gt;no crcSalt = &amp;lt;SOUCE&amp;gt; permits to index again an already indexed file.&lt;/P&gt;&lt;P&gt;it isn't possible to filter some already indexed logs.&lt;/P&gt;&lt;P&gt;Splunk doesn't index twice an intere file already indexed, not a part of it.&lt;/P&gt;&lt;P&gt;Ciao.&lt;/P&gt;&lt;P&gt;Giuseppe&lt;/P&gt;</description>
      <pubDate>Fri, 18 Feb 2022 17:16:05 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Experiencing-duplicate-indexing-How-do-I-solve-this-problem/m-p/585747#M103034</guid>
      <dc:creator>gcusello</dc:creator>
      <dc:date>2022-02-18T17:16:05Z</dc:date>
    </item>
    <item>
      <title>Re: Experiencing duplicate indexing- How do I solve this problem?</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Experiencing-duplicate-indexing-How-do-I-solve-this-problem/m-p/585772#M103037</link>
      <description>&lt;P&gt;There is no built-in deduplication of events on ingestion. As simple as that. If you receive or read the same event twice, it will be ingested and indexed twice. As simple as that.&lt;/P&gt;&lt;P&gt;Having said that - the file monitoring input does remember a hash of a file along with how far it already read it so it will not re-read each monitored file after every restart. The file hash is calculated from the beginning of the file so it stays the same even after some data is appended to the file. And even if the file is renamed (for example - by logrotate), the checksum calculated from the beginning of the file stays the same so the file will not be read again. Adding the crcsalt=&amp;lt;SOURCE&amp;gt; adds a filename to the calculated checksum so two files with different names but the same checksum calculated from the beginning of the file would both get indexed.&lt;/P&gt;</description>
      <pubDate>Fri, 18 Feb 2022 19:30:43 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Experiencing-duplicate-indexing-How-do-I-solve-this-problem/m-p/585772#M103037</guid>
      <dc:creator>PickleRick</dc:creator>
      <dc:date>2022-02-18T19:30:43Z</dc:date>
    </item>
  </channel>
</rss>

