Getting Data In

Appending vs overwriting tailed log files

chjpcert
Explorer

I've been testing Splunk for several months now, and am consistently having problems with duplicate events appearing in the index. This sort of inconsistency is a show-stopper, so I decided to write a simple test case to investigate.

Our actual conditions are:

  • remote log file downloaded to /tmp via HTTP as a cron job
  • when downloading complete, log file in /tmp overwritten over the old log file
  • Splunk picks up the file and should only index new entries

However, throughout the day, we see a lot of duplicate entries, often in the magnitude of 10x the original number of log entries.

Experimenting, it seemed that Splunk might act differently depending on whether the log is being overwritten or appended to.

I ran a test script for about 12 hours, which continually added a new lines to two log files by either appending to a temp file and then overwriting to the monitored file, or directly appending to the monitored file.

Raw files:

$ wc -l ~/data/test/*
   63987 /home/splunk/data/test/append.log
   63374 /home/splunk/data/test/overwrite.log

In Splunk:

/home/splunk/data/test/overwrite.log | 8,994,387
/home/splunk/data/test/append.log | 63,965 

You can clearly see the reindexing issue in overwrite.log. The small difference in append.log is likely Splunk not having indexed some entries yet.

My settings look like this:

[monitor://home/splunk/data/test/]
disabled = false
host = test
index = main
crcSalt = <SOURCE>
followTail = 1

(The actual log file I'm having trouble with contains a long header line, which is why I'm using crcSalt).

I noticed this in the change history for 4.1.4:

"monitor inputs using the followTail setting sometimes will index some older events or all events from log files which are updated when not intended. (SPL-23555) "

Are there any more details on what type of problems SPL-23555 fixed? I haven't seen any changes in behaviour after upgrading.

I've also been receiving these messages in my splunkd.log intermittently:

Time parsed (Mon Sep  6 00:02:16 2010) is too far away from the previous event's time (Mon Sep  6 14:55:55 2010) to be accepted.

The first date comes from the beginning of the file, the second from the end. Does this mean that Splunk is trying to scan the overwritten file before writing is completed, thus treating it as a completely different file with new entries?

Is overwriting log files an acceptable practice when using Splunk? Is this a bug of some kind?

Many thanks,

Chris

Tags (2)
1 Solution

dwaddle
SplunkTrust
SplunkTrust

For starters, the "followTail" option may not be working like you'd expect. You don't need that option generally, if you plan to index the whole file.

One thing you may want to consider is that your overwrite operation is not atomic. That is, there is a window of time where Splunk can see the partial contents of the overwritten file. This can lead to confusion, because the file appears smaller than it was the last time Splunk checked its size.

On a Unix system (which I assume is what you are on), the "mv" is atomic on the same filesystem, but not across filesystems. When you go to overwrite, I'd suggest something similar to this:

[1] In inputs.conf put a "blacklist = \.inprogress$"

[2] In your script that is feeding splunk:

download the file to /tmp
mv file_in_tmp /home/splunk/data/test/myfile.inprogress  # Splunk should ignore
mv -f /home/splunk/data/myfile.inprogress /home/splunk/data/myfile  

On Linux at least (and I would assume for other Unixes) the mv is an atomic operation.

This should work a lot better for what you are trying to accomplish, but a Splunk forwarder (if you have the architecture to support it) would probably be a superior alternative.

View solution in original post

Lowell
Super Champion

Have you considered using rsync with the --append option? Or, if you only have access to these files via HTTP, then perhaps using wget with the --continue option? (Looks like curl has a similar option too).

This may provide a better long term solution since you may be able to avoid repeatedly re-downloading the same content over and over again.

0 Karma

chjpcert
Explorer

Also a good idea - rsync --append seems to avoid duplicates too. Thanks!

0 Karma

gkanapathy
Splunk Employee
Splunk Employee

Possibly there is problem because of your file system type? What operating system are you on, and what is the filesystem type of /tmp. Often (Linux, Solaris), /tmp uses the tmpfs rather than ext3/ext4/xfs/ufs or some other conventional filesystem, and so semantics of operations like mv and cp may not be atomic. Could you try a directory on a regular filesystem and see if that helps?

0 Karma

dwaddle
SplunkTrust
SplunkTrust

For starters, the "followTail" option may not be working like you'd expect. You don't need that option generally, if you plan to index the whole file.

One thing you may want to consider is that your overwrite operation is not atomic. That is, there is a window of time where Splunk can see the partial contents of the overwritten file. This can lead to confusion, because the file appears smaller than it was the last time Splunk checked its size.

On a Unix system (which I assume is what you are on), the "mv" is atomic on the same filesystem, but not across filesystems. When you go to overwrite, I'd suggest something similar to this:

[1] In inputs.conf put a "blacklist = \.inprogress$"

[2] In your script that is feeding splunk:

download the file to /tmp
mv file_in_tmp /home/splunk/data/test/myfile.inprogress  # Splunk should ignore
mv -f /home/splunk/data/myfile.inprogress /home/splunk/data/myfile  

On Linux at least (and I would assume for other Unixes) the mv is an atomic operation.

This should work a lot better for what you are trying to accomplish, but a Splunk forwarder (if you have the architecture to support it) would probably be a superior alternative.

chjpcert
Explorer

dwaddle - how embarrassing: I had forgotten that my data directory IS in fact symlinked to another filesystem! I'm re-testing now by using a temp directory on the same filesystem as my data directory. Short of verifying the results, I'm pretty sure you've cracked it. This has been very interesting to look into - I ended up breaking out inotifywait to watch how mv operates. The difference in behaviour of mv between different filesystems and the same filesystem is plain to see, and is exactly as you describe. A good lesson learned today 🙂

0 Karma

Jason
Motivator

Blacklist should probably be blacklist = .inprogress$ to ensure it ignores files ending in dot inprogress. I don't think the .* or parentheses (capturing group) are necessary either.

0 Karma

dwaddle
SplunkTrust
SplunkTrust

Just to be completely sure, when you are testing with 'mv', are the soruce and destination on the same filesystem? The source & destination have to be on the same filesystem for mv to be atomic.

0 Karma

chjpcert
Explorer

Thanks very much for your help - much appreciated.

Firstly, I've ensured everything is using mv. I've also disabled followTail.

However, I still haven't been able to successfully remove my duplicates. To investigate, I ran three tests. One overwrote the log via cp, one overwrote via mv, and the last directly appended data to the log.

Both the cp'ed and mv'ed logs produced duplicates, while the appended log seems okay. I'll keep examining this from the meantime, but I'm getting the feeling that it might be more than just cp or a followTail setting. Hopefully I'll post my findings soon.

0 Karma
Get Updates on the Splunk Community!

Now Available: Cisco Talos Threat Intelligence Integrations for Splunk Security Cloud ...

At .conf24, we shared that we were in the process of integrating Cisco Talos threat intelligence into Splunk ...

Preparing your Splunk Environment for OpenSSL3

The Splunk platform will transition to OpenSSL version 3 in a future release. Actions are required to prepare ...

Easily Improve Agent Saturation with the Splunk Add-on for OpenTelemetry Collector

Agent Saturation What and Whys In application performance monitoring, saturation is defined as the total load ...