<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic How to avoid indexing events twice in Getting Data In</title>
    <link>https://community.splunk.com/t5/Getting-Data-In/How-to-avoid-indexing-events-twice/m-p/744549#M118315</link>
    <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I'm facing an issue where the same data gets indexed multiple times every time the JSON file is pulled from the FTP server.&lt;/P&gt;&lt;P&gt;Each time the JSON file is retrieved and placed on my local Splunk server, it overwrites the existing file. I don't have control over the content being placed on the FTP server, it could either be an entirely new entry or an existing entry with new data added, as shown below.&lt;/P&gt;&lt;P&gt;I'm monitoring a specific file, as its name, type, and path remain consistent.&lt;/P&gt;&lt;P&gt;From what I can observe, every time the file has new entries alongside previously indexed data, it is re-indexed, causing duplication.&lt;/P&gt;&lt;P&gt;Example:&lt;/P&gt;&lt;P&gt;file.json&lt;/P&gt;&lt;P&gt;2024-04-21 14:00 - row 1&lt;BR /&gt;2024-04-21 14:10 - row 2&lt;/P&gt;&lt;P&gt;overwritten file.json&lt;/P&gt;&lt;P&gt;2024-04-21 14:00 - row 1&lt;BR /&gt;2024-04-21 14:10 - row 2&lt;BR /&gt;2024-04-21 14:20 - row 3&lt;/P&gt;&lt;P&gt;Additionally, I checked the sha256sum of the JSON file after it’s pulled into my local Splunk server. The hash value changes before and after the file is overwritten.&lt;/P&gt;&lt;P&gt;file.json:&lt;/P&gt;&lt;P&gt;2217ee097b7d77ed4b2eabc695b89e5f30d4e8b85c8cbd261613ce65cda0b851 /home/ws/logs/###.json&lt;/P&gt;&lt;P&gt;overwritten file.json:&lt;/P&gt;&lt;P&gt;45b01fabce6f2a75742c192143055d33e5aa28be3d2c3ad324dd2e0af5adf8dd /home/ws/logs//###.json&lt;/P&gt;&lt;P&gt;I've tried using initCrcLength, crcSalt, and followTail, but they don't seem to prevent the duplication, as Splunk still indexes it as new data.&lt;/P&gt;&lt;P&gt;Any assistance would be appreciated, as I can't seem to prevent the duplication in indexing.&lt;/P&gt;</description>
    <pubDate>Mon, 21 Apr 2025 08:12:43 GMT</pubDate>
    <dc:creator>ws</dc:creator>
    <dc:date>2025-04-21T08:12:43Z</dc:date>
    <item>
      <title>How to avoid indexing events twice</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/How-to-avoid-indexing-events-twice/m-p/744549#M118315</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I'm facing an issue where the same data gets indexed multiple times every time the JSON file is pulled from the FTP server.&lt;/P&gt;&lt;P&gt;Each time the JSON file is retrieved and placed on my local Splunk server, it overwrites the existing file. I don't have control over the content being placed on the FTP server, it could either be an entirely new entry or an existing entry with new data added, as shown below.&lt;/P&gt;&lt;P&gt;I'm monitoring a specific file, as its name, type, and path remain consistent.&lt;/P&gt;&lt;P&gt;From what I can observe, every time the file has new entries alongside previously indexed data, it is re-indexed, causing duplication.&lt;/P&gt;&lt;P&gt;Example:&lt;/P&gt;&lt;P&gt;file.json&lt;/P&gt;&lt;P&gt;2024-04-21 14:00 - row 1&lt;BR /&gt;2024-04-21 14:10 - row 2&lt;/P&gt;&lt;P&gt;overwritten file.json&lt;/P&gt;&lt;P&gt;2024-04-21 14:00 - row 1&lt;BR /&gt;2024-04-21 14:10 - row 2&lt;BR /&gt;2024-04-21 14:20 - row 3&lt;/P&gt;&lt;P&gt;Additionally, I checked the sha256sum of the JSON file after it’s pulled into my local Splunk server. The hash value changes before and after the file is overwritten.&lt;/P&gt;&lt;P&gt;file.json:&lt;/P&gt;&lt;P&gt;2217ee097b7d77ed4b2eabc695b89e5f30d4e8b85c8cbd261613ce65cda0b851 /home/ws/logs/###.json&lt;/P&gt;&lt;P&gt;overwritten file.json:&lt;/P&gt;&lt;P&gt;45b01fabce6f2a75742c192143055d33e5aa28be3d2c3ad324dd2e0af5adf8dd /home/ws/logs//###.json&lt;/P&gt;&lt;P&gt;I've tried using initCrcLength, crcSalt, and followTail, but they don't seem to prevent the duplication, as Splunk still indexes it as new data.&lt;/P&gt;&lt;P&gt;Any assistance would be appreciated, as I can't seem to prevent the duplication in indexing.&lt;/P&gt;</description>
      <pubDate>Mon, 21 Apr 2025 08:12:43 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/How-to-avoid-indexing-events-twice/m-p/744549#M118315</guid>
      <dc:creator>ws</dc:creator>
      <dc:date>2025-04-21T08:12:43Z</dc:date>
    </item>
    <item>
      <title>Re: How to avoid indexing events twice</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/How-to-avoid-indexing-events-twice/m-p/744551#M118317</link>
      <description>&lt;P&gt;This is probably because your ftp server is deleting the existing file when you overwrite it so the forwarder sees it as a new file even if it has the same name and content. Try copying the received file on the ftp server to the monitored directory&lt;/P&gt;</description>
      <pubDate>Mon, 21 Apr 2025 08:53:21 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/How-to-avoid-indexing-events-twice/m-p/744551#M118317</guid>
      <dc:creator>ITWhisperer</dc:creator>
      <dc:date>2025-04-21T08:53:21Z</dc:date>
    </item>
    <item>
      <title>Re: How to avoid indexing events twice</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/How-to-avoid-indexing-events-twice/m-p/744554#M118318</link>
      <description>&lt;P&gt;Here's what I’ve tested so far.&lt;/P&gt;&lt;P&gt;1: WinSCP uploads file.json to the FTP server → Splunk local server retrieves the file to a local directory → Splunk reads and indexes the data.&lt;/P&gt;&lt;P&gt;sha256sum /splunk_local/file.json&lt;BR /&gt;45b01fabce6f2a75742c192143055d33e5aa28be3d2c3ad324dd2e0af5adf8dd&lt;/P&gt;&lt;P&gt;2: Deleted file.json from the FTP server → Used WinSCP to re-upload the same file.json → Splunk local server pulled the file to the local directory → Splunk did not index the file.json&lt;/P&gt;&lt;P&gt;sha256sum /splunk_local/file.json&lt;BR /&gt;45b01fabce6f2a75742c192143055d33e5aa28be3d2c3ad324dd2e0af5adf8dd&lt;/P&gt;&lt;P&gt;3: WinSCP overwrote file.json on the FTP server with a version containing both new and existing entries → Splunk local server pulled the updated file to the local directory → Splunk re-read and re-indexed the entire file, including previously indexed data&lt;/P&gt;&lt;P&gt;sha256sum /splunk_local/file.json&lt;BR /&gt;2217ee097b7d77ed4b2eabc695b89e5f30d4e8b85c8cbd261613ce65cda0b851&lt;/P&gt;&lt;P&gt;I noticed that the SHA value only changes when a new entry is added to the file, as seen in scenario 3. However, in scenarios 1 and 2, the SHA value remains the same—even if I delete and re-upload the exact same file to the FTP server and pull it into my local Splunk server.&lt;/P&gt;&lt;P&gt;And yes, I'm pulling the file from the FTP server into my local Splunk server, where the file is being monitored.&lt;/P&gt;</description>
      <pubDate>Mon, 21 Apr 2025 09:47:37 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/How-to-avoid-indexing-events-twice/m-p/744554#M118318</guid>
      <dc:creator>ws</dc:creator>
      <dc:date>2025-04-21T09:47:37Z</dc:date>
    </item>
    <item>
      <title>Re: How to avoid indexing events twice</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/How-to-avoid-indexing-events-twice/m-p/744559#M118320</link>
      <description>&lt;P&gt;Is this "&lt;SPAN&gt;pulling the file from the FTP server into my local Splunk server" using ftp?&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;If so, try&amp;nbsp;pulling the file from the FTP server into my local Splunk server into a different directory, before copying it on the splunk server to the monitored directory.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 21 Apr 2025 10:14:02 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/How-to-avoid-indexing-events-twice/m-p/744559#M118320</guid>
      <dc:creator>ITWhisperer</dc:creator>
      <dc:date>2025-04-21T10:14:02Z</dc:date>
    </item>
    <item>
      <title>Re: How to avoid indexing events twice</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/How-to-avoid-indexing-events-twice/m-p/744574#M118324</link>
      <description>&lt;P&gt;Yes, I'm accessing my FTP server using the FTP method. However, this shouldn't make a difference whether I'm using FTP or SFTP, right? I'm still encountering the same issue, even after copying the file to a different folder before moving it to the monitored directory on the Splunk server.&lt;/P&gt;&lt;P&gt;Just to add on, my file type is JSON.&amp;nbsp;&lt;/P&gt;&lt;P&gt;[Mon Apr 21 20:28:01 +08 2025] Attempting FTP to 192.168.80.139&lt;BR /&gt;Connected to 192.168.80.139 (192.168.80.139).&lt;BR /&gt;220 (vsFTPd 3.0.3)&lt;BR /&gt;331 Please specify the password.&lt;BR /&gt;230 Login successful.&lt;BR /&gt;250 Directory successfully changed.&lt;BR /&gt;Local directory now /home/ws/pull&lt;BR /&gt;221 Goodbye.&lt;BR /&gt;'/home/ws/pull/###_case_final.json' -&amp;gt; '/home/ws/logs/###_case_final.json'&lt;BR /&gt;[Mon Apr 21 20:28:12 +08 2025] Attempting FTP to 192.168.80.139&lt;BR /&gt;Connected to 192.168.80.139 (192.168.80.139).&lt;BR /&gt;220 (vsFTPd 3.0.3)&lt;BR /&gt;331 Please specify the password.&lt;BR /&gt;230 Login successful.&lt;BR /&gt;250 Directory successfully changed.&lt;BR /&gt;Local directory now /home/ws/pull&lt;BR /&gt;local: ###_case_final.json remote: ###_case_final.json&lt;BR /&gt;227 Entering Passive Mode (192,168,80,139,249,175).&lt;BR /&gt;150 Opening BINARY mode data connection for ###_case_final.json (1455 bytes).&lt;BR /&gt;226 Transfer complete.&lt;BR /&gt;1455 bytes received in 8.5e-05 secs (17117.65 Kbytes/sec)&lt;BR /&gt;221 Goodbye.&lt;BR /&gt;'/home/ws/pull/###_case_final.json' -&amp;gt; '/home/ws/logs/###_case_final.json'&lt;/P&gt;&lt;P&gt;As of now, my inputs.conf contain the following only.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="ws_0-1745238960452.png" style="width: 400px;"&gt;&lt;img src="https://community.splunk.com/t5/image/serverpage/image-id/38669iF006DE2FFBC55A1F/image-size/medium?v=v2&amp;amp;px=400" role="button" title="ws_0-1745238960452.png" alt="ws_0-1745238960452.png" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 21 Apr 2025 12:50:45 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/How-to-avoid-indexing-events-twice/m-p/744574#M118324</guid>
      <dc:creator>ws</dc:creator>
      <dc:date>2025-04-21T12:50:45Z</dc:date>
    </item>
    <item>
      <title>Re: How to avoid indexing events twice</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/How-to-avoid-indexing-events-twice/m-p/744575#M118325</link>
      <description>&lt;P&gt;So, are you using (s)ftp to copy from one directory to the final directory or using the cp command (on the server where the monitored directory is)?&lt;/P&gt;</description>
      <pubDate>Mon, 21 Apr 2025 12:53:15 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/How-to-avoid-indexing-events-twice/m-p/744575#M118325</guid>
      <dc:creator>ITWhisperer</dc:creator>
      <dc:date>2025-04-21T12:53:15Z</dc:date>
    </item>
    <item>
      <title>Re: How to avoid indexing events twice</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/How-to-avoid-indexing-events-twice/m-p/744579#M118326</link>
      <description>&lt;P&gt;My original Python script accessed the FTP server directly and used the mget command to retrieve files from the monitored folder.&lt;/P&gt;&lt;P&gt;But as mentioned by you, to try pulling the file from the FTP server into my local Splunk server into a different directory, before copying it on the splunk server to the monitored directory.&lt;/P&gt;&lt;P&gt;I did a slight chance to the script to only cp after it exit from the FTP server.&lt;/P&gt;&lt;P&gt;ftp -inv "$HOST" &amp;lt;&amp;lt;EOF &amp;gt;&amp;gt; /home/ws/fetch_debug.log 2&amp;gt;&amp;amp;1&lt;BR /&gt;user $USER $PASS&lt;BR /&gt;cd $REMOTE_DIR&lt;BR /&gt;lcd /home/ws/pull&lt;BR /&gt;mget *&lt;BR /&gt;bye&lt;BR /&gt;EOF&lt;/P&gt;&lt;P&gt;cp -v /home/ws/pull/*.json /home/ws/logs &amp;gt;&amp;gt; /home/ws/fetch_debug.log 2&amp;gt;&amp;amp;1&lt;/P&gt;</description>
      <pubDate>Mon, 21 Apr 2025 13:53:00 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/How-to-avoid-indexing-events-twice/m-p/744579#M118326</guid>
      <dc:creator>ws</dc:creator>
      <dc:date>2025-04-21T13:53:00Z</dc:date>
    </item>
    <item>
      <title>Re: How to avoid indexing events twice</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/How-to-avoid-indexing-events-twice/m-p/744583#M118328</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.splunk.com/t5/user/viewprofilepage/user-id/276234"&gt;@ws&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;If you are using a script to do this, it might be worth trying to change the process a little bit - instead of downloading the file and overwrite the existing file, try downloading the file as a temp file, then write the contents to the existing file. This will prevent Splunk thinking it is a new file. Theres an interesting thread here&amp;nbsp;&lt;A href="https://community.splunk.com/t5/Getting-Data-In/Duplicate-indexing-of-data/m-p/376619" target="_blank"&gt;https://community.splunk.com/t5/Getting-Data-In/Duplicate-indexing-of-data/m-p/376619&lt;/A&gt;&amp;nbsp;which might help you.&lt;/P&gt;&lt;P&gt;Another thing you could do is change the logging to DEBUG for the following components:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;TailingProcessor&lt;/LI&gt;&lt;LI&gt;BatchReader&lt;/LI&gt;&lt;LI&gt;WatchedFile&lt;/LI&gt;&lt;LI&gt;FileTracker&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Then see what Splunk logs the next time you update the file.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-unicode-emoji" title=":glowing_star:"&gt;🌟&lt;/span&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;Did this answer help you?&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;If so, please consider:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Adding karma to show it was useful&lt;/LI&gt;&lt;LI&gt;Marking it as the solution if it resolved your issue&lt;/LI&gt;&lt;LI&gt;Commenting if you need any clarification&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Your feedback encourages the volunteers in this community to continue contributing&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 21 Apr 2025 15:12:09 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/How-to-avoid-indexing-events-twice/m-p/744583#M118328</guid>
      <dc:creator>livehybrid</dc:creator>
      <dc:date>2025-04-21T15:12:09Z</dc:date>
    </item>
    <item>
      <title>Re: How to avoid indexing events twice</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/How-to-avoid-indexing-events-twice/m-p/744589#M118332</link>
      <description>&lt;P&gt;&lt;a href="https://community.splunk.com/t5/user/viewprofilepage/user-id/170906"&gt;@livehybrid&lt;/a&gt;, ok let me test out the following method as mention to download &lt;SPAN&gt;the file as a temp file, then write the contents to the existing file.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;I believe this can be handled within the same Python script, which connects to the FTP server and downloads the file to my local Splunk server.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks for sharing the additional information. Since I'm still learning, could you advise which log file I should be checking after enabling DEBUG for the following?&lt;/P&gt;&lt;P&gt;Change the logging to DEBUG for the following components:&lt;/P&gt;&lt;P&gt;TailingProcessor&lt;BR /&gt;BatchReader&lt;BR /&gt;WatchedFile&lt;BR /&gt;FileTracker&lt;/P&gt;</description>
      <pubDate>Mon, 21 Apr 2025 16:13:15 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/How-to-avoid-indexing-events-twice/m-p/744589#M118332</guid>
      <dc:creator>ws</dc:creator>
      <dc:date>2025-04-21T16:13:15Z</dc:date>
    </item>
    <item>
      <title>Re: How to avoid indexing events twice</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/How-to-avoid-indexing-events-twice/m-p/744600#M118334</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.splunk.com/t5/user/viewprofilepage/user-id/276234"&gt;@ws&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Let us know how you get on with the Python script.&lt;/P&gt;&lt;P&gt;In the meantime - the file you want to edit is: $SPLUNK_HOME/etc/log.cfg (e.g.&amp;nbsp;/opt/splunk/etc/log.cfg)&lt;/P&gt;&lt;P&gt;Looks for category.&amp;lt;key&amp;gt; and change the default (usually INFO) to DEBUG for those keys. You will need to restart Splunk. Then you should see further info in index=_internal component=&amp;lt;key&amp;gt; which *might* help!&lt;/P&gt;&lt;P&gt;This should be on the forwarder picking up the logs.&lt;/P&gt;&lt;P&gt;Dont forget to add karma/like any posts which help &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;&lt;P&gt;Will&lt;/P&gt;</description>
      <pubDate>Mon, 21 Apr 2025 21:00:46 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/How-to-avoid-indexing-events-twice/m-p/744600#M118334</guid>
      <dc:creator>livehybrid</dc:creator>
      <dc:date>2025-04-21T21:00:46Z</dc:date>
    </item>
    <item>
      <title>Re: How to avoid indexing events twice</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/How-to-avoid-indexing-events-twice/m-p/744655#M118342</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.splunk.com/t5/user/viewprofilepage/user-id/170906"&gt;@livehybrid&lt;/a&gt;, i tried the following method to write into the local file with keeping the file at /tmp but it still didn't work.&lt;/P&gt;&lt;P&gt;As for my situation, i think the best scenario would be keep a record of something like "seen before record.txt" and do a comparison and only to write new records into the file and remove previous indexed entries.&lt;/P&gt;&lt;P&gt;At least the current approach is workable, but we’ll need to monitor the file size of "seen before record.txt" as it continues to grow. For now, the file size isn’t a concern since it only stores a limited number of tracking records.&lt;/P&gt;</description>
      <pubDate>Tue, 22 Apr 2025 12:04:51 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/How-to-avoid-indexing-events-twice/m-p/744655#M118342</guid>
      <dc:creator>ws</dc:creator>
      <dc:date>2025-04-22T12:04:51Z</dc:date>
    </item>
  </channel>
</rss>

