Hi,
I'm facing an issue where the same data gets indexed multiple times every time the JSON file is pulled from the FTP server.
Each time the JSON file is retrieved and placed on my local Splunk server, it overwrites the existing file. I don't have control over the content being placed on the FTP server, it could either be an entirely new entry or an existing entry with new data added, as shown below.
I'm monitoring a specific file, as its name, type, and path remain consistent.
From what I can observe, every time the file has new entries alongside previously indexed data, it is re-indexed, causing duplication.
Example:
file.json
2024-04-21 14:00 - row 1
2024-04-21 14:10 - row 2
overwritten file.json
2024-04-21 14:00 - row 1
2024-04-21 14:10 - row 2
2024-04-21 14:20 - row 3
Additionally, I checked the sha256sum of the JSON file after it’s pulled into my local Splunk server. The hash value changes before and after the file is overwritten.
file.json:
2217ee097b7d77ed4b2eabc695b89e5f30d4e8b85c8cbd261613ce65cda0b851 /home/ws/logs/###.json
overwritten file.json:
45b01fabce6f2a75742c192143055d33e5aa28be3d2c3ad324dd2e0af5adf8dd /home/ws/logs//###.json
I've tried using initCrcLength, crcSalt, and followTail, but they don't seem to prevent the duplication, as Splunk still indexes it as new data.
Any assistance would be appreciated, as I can't seem to prevent the duplication in indexing.
Hi @ws
If you are using a script to do this, it might be worth trying to change the process a little bit - instead of downloading the file and overwrite the existing file, try downloading the file as a temp file, then write the contents to the existing file. This will prevent Splunk thinking it is a new file. Theres an interesting thread here https://community.splunk.com/t5/Getting-Data-In/Duplicate-indexing-of-data/m-p/376619 which might help you.
Another thing you could do is change the logging to DEBUG for the following components:
Then see what Splunk logs the next time you update the file.
🌟 Did this answer help you? If so, please consider:
Your feedback encourages the volunteers in this community to continue contributing
@livehybrid, ok let me test out the following method as mention to download the file as a temp file, then write the contents to the existing file.
I believe this can be handled within the same Python script, which connects to the FTP server and downloads the file to my local Splunk server.
Thanks for sharing the additional information. Since I'm still learning, could you advise which log file I should be checking after enabling DEBUG for the following?
Change the logging to DEBUG for the following components:
TailingProcessor
BatchReader
WatchedFile
FileTracker
Hi @ws
Let us know how you get on with the Python script.
In the meantime - the file you want to edit is: $SPLUNK_HOME/etc/log.cfg (e.g. /opt/splunk/etc/log.cfg)
Looks for category.<key> and change the default (usually INFO) to DEBUG for those keys. You will need to restart Splunk. Then you should see further info in index=_internal component=<key> which *might* help!
This should be on the forwarder picking up the logs.
Dont forget to add karma/like any posts which help 🙂
Thanks
Will
Hi @livehybrid, i tried the following method to write into the local file with keeping the file at /tmp but it still didn't work.
As for my situation, i think the best scenario would be keep a record of something like "seen before record.txt" and do a comparison and only to write new records into the file and remove previous indexed entries.
At least the current approach is workable, but we’ll need to monitor the file size of "seen before record.txt" as it continues to grow. For now, the file size isn’t a concern since it only stores a limited number of tracking records.
This is probably because your ftp server is deleting the existing file when you overwrite it so the forwarder sees it as a new file even if it has the same name and content. Try copying the received file on the ftp server to the monitored directory
Here's what I’ve tested so far.
1: WinSCP uploads file.json to the FTP server → Splunk local server retrieves the file to a local directory → Splunk reads and indexes the data.
sha256sum /splunk_local/file.json
45b01fabce6f2a75742c192143055d33e5aa28be3d2c3ad324dd2e0af5adf8dd
2: Deleted file.json from the FTP server → Used WinSCP to re-upload the same file.json → Splunk local server pulled the file to the local directory → Splunk did not index the file.json
sha256sum /splunk_local/file.json
45b01fabce6f2a75742c192143055d33e5aa28be3d2c3ad324dd2e0af5adf8dd
3: WinSCP overwrote file.json on the FTP server with a version containing both new and existing entries → Splunk local server pulled the updated file to the local directory → Splunk re-read and re-indexed the entire file, including previously indexed data
sha256sum /splunk_local/file.json
2217ee097b7d77ed4b2eabc695b89e5f30d4e8b85c8cbd261613ce65cda0b851
I noticed that the SHA value only changes when a new entry is added to the file, as seen in scenario 3. However, in scenarios 1 and 2, the SHA value remains the same—even if I delete and re-upload the exact same file to the FTP server and pull it into my local Splunk server.
And yes, I'm pulling the file from the FTP server into my local Splunk server, where the file is being monitored.
Is this "pulling the file from the FTP server into my local Splunk server" using ftp?
If so, try pulling the file from the FTP server into my local Splunk server into a different directory, before copying it on the splunk server to the monitored directory.
Yes, I'm accessing my FTP server using the FTP method. However, this shouldn't make a difference whether I'm using FTP or SFTP, right? I'm still encountering the same issue, even after copying the file to a different folder before moving it to the monitored directory on the Splunk server.
Just to add on, my file type is JSON.
[Mon Apr 21 20:28:01 +08 2025] Attempting FTP to 192.168.80.139
Connected to 192.168.80.139 (192.168.80.139).
220 (vsFTPd 3.0.3)
331 Please specify the password.
230 Login successful.
250 Directory successfully changed.
Local directory now /home/ws/pull
221 Goodbye.
'/home/ws/pull/###_case_final.json' -> '/home/ws/logs/###_case_final.json'
[Mon Apr 21 20:28:12 +08 2025] Attempting FTP to 192.168.80.139
Connected to 192.168.80.139 (192.168.80.139).
220 (vsFTPd 3.0.3)
331 Please specify the password.
230 Login successful.
250 Directory successfully changed.
Local directory now /home/ws/pull
local: ###_case_final.json remote: ###_case_final.json
227 Entering Passive Mode (192,168,80,139,249,175).
150 Opening BINARY mode data connection for ###_case_final.json (1455 bytes).
226 Transfer complete.
1455 bytes received in 8.5e-05 secs (17117.65 Kbytes/sec)
221 Goodbye.
'/home/ws/pull/###_case_final.json' -> '/home/ws/logs/###_case_final.json'
As of now, my inputs.conf contain the following only.
So, are you using (s)ftp to copy from one directory to the final directory or using the cp command (on the server where the monitored directory is)?
My original Python script accessed the FTP server directly and used the mget command to retrieve files from the monitored folder.
But as mentioned by you, to try pulling the file from the FTP server into my local Splunk server into a different directory, before copying it on the splunk server to the monitored directory.
I did a slight chance to the script to only cp after it exit from the FTP server.
ftp -inv "$HOST" <<EOF >> /home/ws/fetch_debug.log 2>&1
user $USER $PASS
cd $REMOTE_DIR
lcd /home/ws/pull
mget *
bye
EOF
cp -v /home/ws/pull/*.json /home/ws/logs >> /home/ws/fetch_debug.log 2>&1