<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic How to avoid indexing duplicates? in Getting Data In</title>
    <link>https://community.splunk.com/t5/Getting-Data-In/How-to-avoid-indexing-duplicates/m-p/672244#M112615</link>
    <description>&lt;P&gt;We are trying to ingest large (peta bytes) information into Splunk.&amp;nbsp;&lt;/P&gt;&lt;P&gt;The Events are in JSON file structure like - 'audit_events_ip-10-23-186-200_1.1512077259453.json'&lt;/P&gt;&lt;P&gt;The pipeline is like -&amp;nbsp;&lt;/P&gt;&lt;P&gt;JSON files &amp;gt; Folder &amp;gt; UF &amp;gt; HF Cluster &amp;gt; Indexer Cluster&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;~ UF - inputs.conf&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;[&lt;A href="batch:///folder" target="_blank" rel="noopener"&gt;batch:///folder&lt;/A&gt;]&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;_TCP_ROUTING = p2s_au_hf&lt;/P&gt;&lt;P class=""&gt;crcSalt = &amp;lt;SOURCE&amp;gt;&lt;/P&gt;&lt;P class=""&gt;disabled = false&lt;/P&gt;&lt;P class=""&gt;move_policy = sinkhole&lt;/P&gt;&lt;P class=""&gt;recursive = false&lt;/P&gt;&lt;P class=""&gt;whitelist = \.json$&lt;/P&gt;&lt;P class=""&gt;&amp;nbsp;&lt;/P&gt;&lt;P class=""&gt;We are seeing the events from specific files (NOT all) are getting duplicated. It indexes from some file 2 times exactly.&amp;nbsp;&lt;/P&gt;&lt;P class=""&gt;As it is [batch:///] which suppose to delete the file after reading it &amp;amp; crcSalt=&amp;lt;SOURCE&amp;gt;, we are NOT able to figure out why &amp;amp; what creates the duplicates.&amp;nbsp;&lt;/P&gt;&lt;P class=""&gt;Would appreciate any help, reference or pointers. Thanks in advance!!!&lt;/P&gt;</description>
    <pubDate>Tue, 19 Dec 2023 04:42:06 GMT</pubDate>
    <dc:creator>subasm</dc:creator>
    <dc:date>2023-12-19T04:42:06Z</dc:date>
    <item>
      <title>How to avoid indexing duplicates?</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/How-to-avoid-indexing-duplicates/m-p/672244#M112615</link>
      <description>&lt;P&gt;We are trying to ingest large (peta bytes) information into Splunk.&amp;nbsp;&lt;/P&gt;&lt;P&gt;The Events are in JSON file structure like - 'audit_events_ip-10-23-186-200_1.1512077259453.json'&lt;/P&gt;&lt;P&gt;The pipeline is like -&amp;nbsp;&lt;/P&gt;&lt;P&gt;JSON files &amp;gt; Folder &amp;gt; UF &amp;gt; HF Cluster &amp;gt; Indexer Cluster&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;~ UF - inputs.conf&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;[&lt;A href="batch:///folder" target="_blank" rel="noopener"&gt;batch:///folder&lt;/A&gt;]&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;_TCP_ROUTING = p2s_au_hf&lt;/P&gt;&lt;P class=""&gt;crcSalt = &amp;lt;SOURCE&amp;gt;&lt;/P&gt;&lt;P class=""&gt;disabled = false&lt;/P&gt;&lt;P class=""&gt;move_policy = sinkhole&lt;/P&gt;&lt;P class=""&gt;recursive = false&lt;/P&gt;&lt;P class=""&gt;whitelist = \.json$&lt;/P&gt;&lt;P class=""&gt;&amp;nbsp;&lt;/P&gt;&lt;P class=""&gt;We are seeing the events from specific files (NOT all) are getting duplicated. It indexes from some file 2 times exactly.&amp;nbsp;&lt;/P&gt;&lt;P class=""&gt;As it is [batch:///] which suppose to delete the file after reading it &amp;amp; crcSalt=&amp;lt;SOURCE&amp;gt;, we are NOT able to figure out why &amp;amp; what creates the duplicates.&amp;nbsp;&lt;/P&gt;&lt;P class=""&gt;Would appreciate any help, reference or pointers. Thanks in advance!!!&lt;/P&gt;</description>
      <pubDate>Tue, 19 Dec 2023 04:42:06 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/How-to-avoid-indexing-duplicates/m-p/672244#M112615</guid>
      <dc:creator>subasm</dc:creator>
      <dc:date>2023-12-19T04:42:06Z</dc:date>
    </item>
    <item>
      <title>Re: How to avoid indexing duplicates?</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/How-to-avoid-indexing-duplicates/m-p/672254#M112616</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.splunk.com/t5/user/viewprofilepage/user-id/263439"&gt;@subasm&lt;/a&gt;,&lt;/P&gt;&lt;P&gt;probably your logs are rotated in a different file at midnight, so the crcSal option duplicates your indexed data, did you tried without this option?&lt;/P&gt;&lt;P&gt;Ciao.&lt;/P&gt;&lt;P&gt;Giuseppe&lt;/P&gt;</description>
      <pubDate>Tue, 19 Dec 2023 07:10:02 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/How-to-avoid-indexing-duplicates/m-p/672254#M112616</guid>
      <dc:creator>gcusello</dc:creator>
      <dc:date>2023-12-19T07:10:02Z</dc:date>
    </item>
    <item>
      <title>Re: How to avoid indexing duplicates?</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/How-to-avoid-indexing-duplicates/m-p/672257#M112617</link>
      <description>&lt;P&gt;We are manually copying the files to the &amp;lt;DIR&amp;gt; and from there onwards UF is supposed to pick up.&lt;/P&gt;&lt;P&gt;So I don't think there is rolling over of the same files at midnight.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 19 Dec 2023 07:33:15 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/How-to-avoid-indexing-duplicates/m-p/672257#M112617</guid>
      <dc:creator>subasm</dc:creator>
      <dc:date>2023-12-19T07:33:15Z</dc:date>
    </item>
    <item>
      <title>Re: How to avoid indexing duplicates?</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/How-to-avoid-indexing-duplicates/m-p/672261#M112618</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.splunk.com/t5/user/viewprofilepage/user-id/263439"&gt;@subasm&lt;/a&gt;,&lt;/P&gt;&lt;P&gt;if there isn't a rotation, the data are duplicatd at the origin, anyway, if you don't use crcSalt option you have sure to avoid duplicates because Splunk uses its archive (_fishbuckets) to store the already ingested data.&lt;/P&gt;&lt;P&gt;Ciao.&lt;/P&gt;&lt;P&gt;Giuseppe&lt;/P&gt;</description>
      <pubDate>Tue, 19 Dec 2023 07:38:07 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/How-to-avoid-indexing-duplicates/m-p/672261#M112618</guid>
      <dc:creator>gcusello</dc:creator>
      <dc:date>2023-12-19T07:38:07Z</dc:date>
    </item>
    <item>
      <title>Re: How to avoid indexing duplicates?</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/How-to-avoid-indexing-duplicates/m-p/672306#M112622</link>
      <description>&lt;P&gt;Apparently the source files transfer to folder is in our control - it is verified that the data is NOT duplicates.&amp;nbsp;&lt;/P&gt;&lt;P&gt;It seems to me there are issues while the data is inflight UF -&amp;gt; HF -&amp;gt; Indexers.&lt;/P&gt;&lt;P&gt;Not sure how the ACK works in this set up.&amp;nbsp;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 19 Dec 2023 13:47:10 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/How-to-avoid-indexing-duplicates/m-p/672306#M112622</guid>
      <dc:creator>subasm</dc:creator>
      <dc:date>2023-12-19T13:47:10Z</dc:date>
    </item>
    <item>
      <title>Re: How to avoid indexing duplicates?</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/How-to-avoid-indexing-duplicates/m-p/672309#M112625</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.splunk.com/t5/user/viewprofilepage/user-id/263439"&gt;@subasm&lt;/a&gt;,&lt;/P&gt;&lt;P&gt;I'm quite sure that the issue is in the data.&lt;/P&gt;&lt;P&gt;Open a case to Splunk Support to be sure.&lt;/P&gt;&lt;P&gt;Ciao.&lt;/P&gt;&lt;P&gt;Giuseppe&lt;/P&gt;</description>
      <pubDate>Tue, 19 Dec 2023 14:05:30 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/How-to-avoid-indexing-duplicates/m-p/672309#M112625</guid>
      <dc:creator>gcusello</dc:creator>
      <dc:date>2023-12-19T14:05:30Z</dc:date>
    </item>
  </channel>
</rss>

