topic How to avoid indexing events twice in Getting Data In

How to avoid indexing events twice

ws — Mon, 21 Apr 2025 08:12:43 GMT

Hi,

I'm facing an issue where the same data gets indexed multiple times every time the JSON file is pulled from the FTP server.

Each time the JSON file is retrieved and placed on my local Splunk server, it overwrites the existing file. I don't have control over the content being placed on the FTP server, it could either be an entirely new entry or an existing entry with new data added, as shown below.

I'm monitoring a specific file, as its name, type, and path remain consistent.

From what I can observe, every time the file has new entries alongside previously indexed data, it is re-indexed, causing duplication.

Example:

file.json

2024-04-21 14:00 - row 1
2024-04-21 14:10 - row 2

overwritten file.json

2024-04-21 14:00 - row 1
2024-04-21 14:10 - row 2
2024-04-21 14:20 - row 3

Additionally, I checked the sha256sum of the JSON file after it’s pulled into my local Splunk server. The hash value changes before and after the file is overwritten.

file.json:

2217ee097b7d77ed4b2eabc695b89e5f30d4e8b85c8cbd261613ce65cda0b851 /home/ws/logs/###.json

overwritten file.json:

45b01fabce6f2a75742c192143055d33e5aa28be3d2c3ad324dd2e0af5adf8dd /home/ws/logs//###.json

I've tried using initCrcLength, crcSalt, and followTail, but they don't seem to prevent the duplication, as Splunk still indexes it as new data.

Any assistance would be appreciated, as I can't seem to prevent the duplication in indexing.

Re: How to avoid indexing events twice

ITWhisperer — Mon, 21 Apr 2025 08:53:21 GMT

This is probably because your ftp server is deleting the existing file when you overwrite it so the forwarder sees it as a new file even if it has the same name and content. Try copying the received file on the ftp server to the monitored directory

Re: How to avoid indexing events twice

ws — Mon, 21 Apr 2025 09:47:37 GMT

Here's what I’ve tested so far.

1: WinSCP uploads file.json to the FTP server → Splunk local server retrieves the file to a local directory → Splunk reads and indexes the data.

sha256sum /splunk_local/file.json
45b01fabce6f2a75742c192143055d33e5aa28be3d2c3ad324dd2e0af5adf8dd

2: Deleted file.json from the FTP server → Used WinSCP to re-upload the same file.json → Splunk local server pulled the file to the local directory → Splunk did not index the file.json

sha256sum /splunk_local/file.json
45b01fabce6f2a75742c192143055d33e5aa28be3d2c3ad324dd2e0af5adf8dd

3: WinSCP overwrote file.json on the FTP server with a version containing both new and existing entries → Splunk local server pulled the updated file to the local directory → Splunk re-read and re-indexed the entire file, including previously indexed data

sha256sum /splunk_local/file.json
2217ee097b7d77ed4b2eabc695b89e5f30d4e8b85c8cbd261613ce65cda0b851

I noticed that the SHA value only changes when a new entry is added to the file, as seen in scenario 3. However, in scenarios 1 and 2, the SHA value remains the same—even if I delete and re-upload the exact same file to the FTP server and pull it into my local Splunk server.

And yes, I'm pulling the file from the FTP server into my local Splunk server, where the file is being monitored.

Re: How to avoid indexing events twice

ITWhisperer — Mon, 21 Apr 2025 10:14:02 GMT

Is this "pulling the file from the FTP server into my local Splunk server" using ftp?

If so, try pulling the file from the FTP server into my local Splunk server into a different directory, before copying it on the splunk server to the monitored directory.

Re: How to avoid indexing events twice

ws — Mon, 21 Apr 2025 12:50:45 GMT

Yes, I'm accessing my FTP server using the FTP method. However, this shouldn't make a difference whether I'm using FTP or SFTP, right? I'm still encountering the same issue, even after copying the file to a different folder before moving it to the monitored directory on the Splunk server.

Just to add on, my file type is JSON.

[Mon Apr 21 20:28:01 +08 2025] Attempting FTP to 192.168.80.139
Connected to 192.168.80.139 (192.168.80.139).
220 (vsFTPd 3.0.3)
331 Please specify the password.
230 Login successful.
250 Directory successfully changed.
Local directory now /home/ws/pull
221 Goodbye.
'/home/ws/pull/###_case_final.json' -> '/home/ws/logs/###_case_final.json'
[Mon Apr 21 20:28:12 +08 2025] Attempting FTP to 192.168.80.139
Connected to 192.168.80.139 (192.168.80.139).
220 (vsFTPd 3.0.3)
331 Please specify the password.
230 Login successful.
250 Directory successfully changed.
Local directory now /home/ws/pull
local: ###_case_final.json remote: ###_case_final.json
227 Entering Passive Mode (192,168,80,139,249,175).
150 Opening BINARY mode data connection for ###_case_final.json (1455 bytes).
226 Transfer complete.
1455 bytes received in 8.5e-05 secs (17117.65 Kbytes/sec)
221 Goodbye.
'/home/ws/pull/###_case_final.json' -> '/home/ws/logs/###_case_final.json'

As of now, my inputs.conf contain the following only.

Re: How to avoid indexing events twice

ITWhisperer — Mon, 21 Apr 2025 12:53:15 GMT

So, are you using (s)ftp to copy from one directory to the final directory or using the cp command (on the server where the monitored directory is)?

Re: How to avoid indexing events twice

ws — Mon, 21 Apr 2025 13:53:00 GMT

My original Python script accessed the FTP server directly and used the mget command to retrieve files from the monitored folder.

But as mentioned by you, to try pulling the file from the FTP server into my local Splunk server into a different directory, before copying it on the splunk server to the monitored directory.

I did a slight chance to the script to only cp after it exit from the FTP server.

ftp -inv "$HOST" <<EOF >> /home/ws/fetch_debug.log 2>&1
user $USER $PASS
cd $REMOTE_DIR
lcd /home/ws/pull
mget *
bye
EOF

cp -v /home/ws/pull/*.json /home/ws/logs >> /home/ws/fetch_debug.log 2>&1

Re: How to avoid indexing events twice

livehybrid — Mon, 21 Apr 2025 15:12:09 GMT

Hi @ws

If you are using a script to do this, it might be worth trying to change the process a little bit - instead of downloading the file and overwrite the existing file, try downloading the file as a temp file, then write the contents to the existing file. This will prevent Splunk thinking it is a new file. Theres an interesting thread here https://community.splunk.com/t5/Getting-Data-In/Duplicate-indexing-of-data/m-p/376619 which might help you.

Another thing you could do is change the logging to DEBUG for the following components:

TailingProcessor
BatchReader
WatchedFile
FileTracker

Then see what Splunk logs the next time you update the file.

🌟 Did this answer help you? If so, please consider:

Adding karma to show it was useful
Marking it as the solution if it resolved your issue
Commenting if you need any clarification

Your feedback encourages the volunteers in this community to continue contributing

Re: How to avoid indexing events twice

ws — Mon, 21 Apr 2025 16:13:15 GMT

@livehybrid, ok let me test out the following method as mention to download the file as a temp file, then write the contents to the existing file.

I believe this can be handled within the same Python script, which connects to the FTP server and downloads the file to my local Splunk server.

Thanks for sharing the additional information. Since I'm still learning, could you advise which log file I should be checking after enabling DEBUG for the following?

Change the logging to DEBUG for the following components:

TailingProcessor
BatchReader
WatchedFile
FileTracker

Re: How to avoid indexing events twice

livehybrid — Mon, 21 Apr 2025 21:00:46 GMT

Hi @ws

Let us know how you get on with the Python script.

In the meantime - the file you want to edit is: $SPLUNK_HOME/etc/log.cfg (e.g. /opt/splunk/etc/log.cfg)

Looks for category.<key> and change the default (usually INFO) to DEBUG for those keys. You will need to restart Splunk. Then you should see further info in index=_internal component=<key> which *might* help!

This should be on the forwarder picking up the logs.

Dont forget to add karma/like any posts which help 🙂

Thanks

Will

Re: How to avoid indexing events twice

ws — Tue, 22 Apr 2025 12:04:51 GMT

Hi @livehybrid, i tried the following method to write into the local file with keeping the file at /tmp but it still didn't work.

As for my situation, i think the best scenario would be keep a record of something like "seen before record.txt" and do a comparison and only to write new records into the file and remove previous indexed entries.

At least the current approach is workable, but we’ll need to monitor the file size of "seen before record.txt" as it continues to grow. For now, the file size isn’t a concern since it only stores a limited number of tracking records.