I've been testing Splunk for several months now, and am consistently having problems with duplicate events appearing in the index. This sort of inconsistency is a show-stopper, so I decided to write a simple test case to investigate.
Our actual conditions are:
However, throughout the day, we see a lot of duplicate entries, often in the magnitude of 10x the original number of log entries.
Experimenting, it seemed that Splunk might act differently depending on whether the log is being overwritten or appended to.
I ran a test script for about 12 hours, which continually added a new lines to two log files by either appending to a temp file and then overwriting to the monitored file, or directly appending to the monitored file.
Raw files:
$ wc -l ~/data/test/*
63987 /home/splunk/data/test/append.log
63374 /home/splunk/data/test/overwrite.log
In Splunk:
/home/splunk/data/test/overwrite.log | 8,994,387
/home/splunk/data/test/append.log | 63,965
You can clearly see the reindexing issue in overwrite.log. The small difference in append.log is likely Splunk not having indexed some entries yet.
My settings look like this:
[monitor://home/splunk/data/test/]
disabled = false
host = test
index = main
crcSalt = <SOURCE>
followTail = 1
(The actual log file I'm having trouble with contains a long header line, which is why I'm using crcSalt).
I noticed this in the change history for 4.1.4:
"monitor inputs using the followTail setting sometimes will index some older events or all events from log files which are updated when not intended. (SPL-23555) "
Are there any more details on what type of problems SPL-23555 fixed? I haven't seen any changes in behaviour after upgrading.
I've also been receiving these messages in my splunkd.log intermittently:
Time parsed (Mon Sep 6 00:02:16 2010) is too far away from the previous event's time (Mon Sep 6 14:55:55 2010) to be accepted.
The first date comes from the beginning of the file, the second from the end. Does this mean that Splunk is trying to scan the overwritten file before writing is completed, thus treating it as a completely different file with new entries?
Is overwriting log files an acceptable practice when using Splunk? Is this a bug of some kind?
Many thanks,
Chris
For starters, the "followTail" option may not be working like you'd expect. You don't need that option generally, if you plan to index the whole file.
One thing you may want to consider is that your overwrite operation is not atomic. That is, there is a window of time where Splunk can see the partial contents of the overwritten file. This can lead to confusion, because the file appears smaller than it was the last time Splunk checked its size.
On a Unix system (which I assume is what you are on), the "mv" is atomic on the same filesystem, but not across filesystems. When you go to overwrite, I'd suggest something similar to this:
[1] In inputs.conf put a "blacklist = \.inprogress$"
[2] In your script that is feeding splunk:
download the file to /tmp
mv file_in_tmp /home/splunk/data/test/myfile.inprogress # Splunk should ignore
mv -f /home/splunk/data/myfile.inprogress /home/splunk/data/myfile
On Linux at least (and I would assume for other Unixes) the mv is an atomic operation.
This should work a lot better for what you are trying to accomplish, but a Splunk forwarder (if you have the architecture to support it) would probably be a superior alternative.
Have you considered using rsync
with the --append
option? Or, if you only have access to these files via HTTP, then perhaps using wget
with the --continue
option? (Looks like curl
has a similar option too).
This may provide a better long term solution since you may be able to avoid repeatedly re-downloading the same content over and over again.
Also a good idea - rsync --append seems to avoid duplicates too. Thanks!
Possibly there is problem because of your file system type? What operating system are you on, and what is the filesystem type of /tmp
. Often (Linux, Solaris), /tmp
uses the tmpfs
rather than ext3
/ext4
/xfs
/ufs
or some other conventional filesystem, and so semantics of operations like mv
and cp
may not be atomic. Could you try a directory on a regular filesystem and see if that helps?
For starters, the "followTail" option may not be working like you'd expect. You don't need that option generally, if you plan to index the whole file.
One thing you may want to consider is that your overwrite operation is not atomic. That is, there is a window of time where Splunk can see the partial contents of the overwritten file. This can lead to confusion, because the file appears smaller than it was the last time Splunk checked its size.
On a Unix system (which I assume is what you are on), the "mv" is atomic on the same filesystem, but not across filesystems. When you go to overwrite, I'd suggest something similar to this:
[1] In inputs.conf put a "blacklist = \.inprogress$"
[2] In your script that is feeding splunk:
download the file to /tmp
mv file_in_tmp /home/splunk/data/test/myfile.inprogress # Splunk should ignore
mv -f /home/splunk/data/myfile.inprogress /home/splunk/data/myfile
On Linux at least (and I would assume for other Unixes) the mv is an atomic operation.
This should work a lot better for what you are trying to accomplish, but a Splunk forwarder (if you have the architecture to support it) would probably be a superior alternative.
dwaddle - how embarrassing: I had forgotten that my data directory IS in fact symlinked to another filesystem! I'm re-testing now by using a temp directory on the same filesystem as my data directory. Short of verifying the results, I'm pretty sure you've cracked it. This has been very interesting to look into - I ended up breaking out inotifywait to watch how mv operates. The difference in behaviour of mv between different filesystems and the same filesystem is plain to see, and is exactly as you describe. A good lesson learned today 🙂
Blacklist should probably be blacklist = .inprogress$ to ensure it ignores files ending in dot inprogress. I don't think the .* or parentheses (capturing group) are necessary either.
Just to be completely sure, when you are testing with 'mv', are the soruce and destination on the same filesystem? The source & destination have to be on the same filesystem for mv to be atomic.
Thanks very much for your help - much appreciated.
Firstly, I've ensured everything is using mv. I've also disabled followTail.
However, I still haven't been able to successfully remove my duplicates. To investigate, I ran three tests. One overwrote the log via cp, one overwrote via mv, and the last directly appended data to the log.
Both the cp'ed and mv'ed logs produced duplicates, while the appended log seems okay. I'll keep examining this from the meantime, but I'm getting the feeling that it might be more than just cp or a followTail setting. Hopefully I'll post my findings soon.