I've been testing Splunk for several months now, and am consistently having problems with duplicate events appearing in the index. This sort of inconsistency is a show-stopper, so I decided to write a simple test case to investigate.
Our actual conditions are:
However, throughout the day, we see a lot of duplicate entries, often in the magnitude of 10x the original number of log entries.
Experimenting, it seemed that Splunk might act differently depending on whether the log is being overwritten or appended to.
I ran a test script for about 12 hours, which continually added a new lines to two log files by either appending to a temp file and then overwriting to the monitored file, or directly appending to the monitored file.
$ wc -l ~/data/test/* 63987 /home/splunk/data/test/append.log 63374 /home/splunk/data/test/overwrite.log
/home/splunk/data/test/overwrite.log | 8,994,387 /home/splunk/data/test/append.log | 63,965
You can clearly see the reindexing issue in overwrite.log. The small difference in append.log is likely Splunk not having indexed some entries yet.
My settings look like this:
[monitor://home/splunk/data/test/] disabled = false host = test index = main crcSalt = <SOURCE> followTail = 1
(The actual log file I'm having trouble with contains a long header line, which is why I'm using crcSalt).
I noticed this in the change history for 4.1.4:
"monitor inputs using the followTail setting sometimes will index some older events or all events from log files which are updated when not intended. (SPL-23555) "
Are there any more details on what type of problems SPL-23555 fixed? I haven't seen any changes in behaviour after upgrading.
I've also been receiving these messages in my splunkd.log intermittently:
Time parsed (Mon Sep 6 00:02:16 2010) is too far away from the previous event's time (Mon Sep 6 14:55:55 2010) to be accepted.
The first date comes from the beginning of the file, the second from the end. Does this mean that Splunk is trying to scan the overwritten file before writing is completed, thus treating it as a completely different file with new entries?
Is overwriting log files an acceptable practice when using Splunk? Is this a bug of some kind?
For starters, the "followTail" option may not be working like you'd expect. You don't need that option generally, if you plan to index the whole file.
One thing you may want to consider is that your overwrite operation is not atomic. That is, there is a window of time where Splunk can see the partial contents of the overwritten file. This can lead to confusion, because the file appears smaller than it was the last time Splunk checked its size.
On a Unix system (which I assume is what you are on), the "mv" is atomic on the same filesystem, but not across filesystems. When you go to overwrite, I'd suggest something similar to this:
 In inputs.conf put a "blacklist = \.inprogress$"
 In your script that is feeding splunk:
download the file to /tmp mv file_in_tmp /home/splunk/data/test/myfile.inprogress # Splunk should ignore mv -f /home/splunk/data/myfile.inprogress /home/splunk/data/myfile
On Linux at least (and I would assume for other Unixes) the mv is an atomic operation.
This should work a lot better for what you are trying to accomplish, but a Splunk forwarder (if you have the architecture to support it) would probably be a superior alternative.
Thanks very much for your help - much appreciated.
Firstly, I've ensured everything is using mv. I've also disabled followTail.
However, I still haven't been able to successfully remove my duplicates. To investigate, I ran three tests. One overwrote the log via cp, one overwrote via mv, and the last directly appended data to the log.
Both the cp'ed and mv'ed logs produced duplicates, while the appended log seems okay. I'll keep examining this from the meantime, but I'm getting the feeling that it might be more than just cp or a followTail setting. Hopefully I'll post my findings soon.
Just to be completely sure, when you are testing with 'mv', are the soruce and destination on the same filesystem? The source & destination have to be on the same filesystem for mv to be atomic.
Blacklist should probably be blacklist = .inprogress$ to ensure it ignores files ending in dot inprogress. I don't think the .* or parentheses (capturing group) are necessary either.
dwaddle - how embarrassing: I had forgotten that my data directory IS in fact symlinked to another filesystem! I'm re-testing now by using a temp directory on the same filesystem as my data directory. Short of verifying the results, I'm pretty sure you've cracked it. This has been very interesting to look into - I ended up breaking out inotifywait to watch how mv operates. The difference in behaviour of mv between different filesystems and the same filesystem is plain to see, and is exactly as you describe. A good lesson learned today 🙂
Possibly there is problem because of your file system type? What operating system are you on, and what is the filesystem type of
/tmp. Often (Linux, Solaris),
/tmp uses the
tmpfs rather than
ufs or some other conventional filesystem, and so semantics of operations like
cp may not be atomic. Could you try a directory on a regular filesystem and see if that helps?
Have you considered using
rsync with the
--append option? Or, if you only have access to these files via HTTP, then perhaps using
wget with the
--continue option? (Looks like
curl has a similar option too).
This may provide a better long term solution since you may be able to avoid repeatedly re-downloading the same content over and over again.