Getting Data In

Log rotation best practices

Motivator

We have a couple of files, that are rotated by copying and then truncating the original file (so no new inode is created for the monitored file). We use lagadm on Solaris but I think there will be a lot of people using logrotate on Linux using the copytruncate option for apache logs. We have noticed, that for those files it happens that we get the following error message:

06-05-2012 09:46:01.761 +0200 ERROR TailingProcessor - Ignoring path due to: failed to compute crc for /path/to/my.log (method: 0, hint: Error 0)

This error message occurs when the log rotation is in progress. If the error message is written by splunk the affected file will not be indexed from then on.

Is there a recommended way how logfiles should be rotated? I guess, that for apache using piped logs is the solution (http://httpd.apache.org/docs/2.4/logs.html#piped). But what about other programs? Anyone?

Thanks

Chris

Tags (3)
1 Solution

Motivator

Apparently there is a fix for this issue which will go into the 4.3.3 release

View solution in original post

Splunk Employee
Splunk Employee

On this specific topic, there's no real reason to use copy-truncate for apache. It can be told to reopen its logs to avoid the races.


Ironically, the recommended strategy is listed immediately above the topic linked to: http://httpd.apache.org/docs/2.4/logs.html#rotation They are using the rename + recreate pattern.

0 Karma

Splunk Employee
Splunk Employee

Log rotation best practice is to use rename-and-recreate, or time-based logfile names (which in truth is not rotation at all, but still a good practice in some cases.)

Rename-and-recreate is the most common pattern, and is used often by both external rotation programs, and by programs that self-rotate.

In this pattern, the logging application typically writes to a file with a fixed name. For example if the application is called Frog Decorator, the output might got to /var/log/frog-decorator.log or /opt/frog-decorator/logs/frog-decorator.log etc. If the application is self-rotating, it will at some point (daily, perhaps, or based on size-policy), rename frog-decorator.log to another name, eg frog-decorator.log.1, and then it will open a new frog-decorator.log file and write there.

It is possible for self-rotating programs to cause all writes to always land in the stable name, but in multithreaded applications, it is common that some writes will land in the renamed file during the rotation, because continued operation of those threads is valued over neat and clean write patterns to log files.

External rename & recreate is essentially the same, but requires a method of co-ordination between the rotation program and the logging application. After the rotation program has renamed the original log file, it must tell the original application that this has occurred, so the logging application knows it is time to re-open the 'frog-decorator.log', causing the application to create the new log, and land its log writes in that location. The typical communication path on unix is to send a signal to the application, often via the Hangup signal (SIGHUP), but to some degree it varies. Some programs provide a tool to tell them to reopen their logfile.

For the external rename and recreate it is unavoidable that some log writes will land in the renamed logfile.1, but this is not particularly harmful. However it is essential that the application be notified that it need to recreate the original file BEFORE you compress logfile.1. Remember that "compressing a log file" is actually multiple steps:

1 - Read the contents of the original file
2 - compress those contents in memory
3 - write out the compressed contents to a new file
4 - delete the original file

If you "compress" logfile.1 too early, you run the risk of hitting step 4, deleting the original, while the application is still writing to the file. This will cause data loss.

In general, compressing the first logfile can introduce races with the original app, as well as some races with Splunk. I strongly suggest waiting until the second backup, logfile.2 to begin compression.


The other high-quality pattern, with its own tradeoffs for management, is to simply name the logfiles uniquely on original creation.

For example, if the application "rotates" daily, it could create filenames by the date, such as frog-decorator-2016-05-03.log. When the day changes, it can simply create a new logfile and redirect its output to the new file.

Time-based filenames can be valuable especially on Microsoft Windows platforms, where renaming open files is a very complicated and difficult business. It is also well-behaved, like rename-and-recreate, in that the data persistence story is very simple.


The copy-truncate pattern is lower quality, because there is no way to ensure all data is retained. There is an inherent race condition between the logging application and the program performing the copy & truncate. Data can be written to the file after the copy and before the truncate. This data will be lost.

Additionally, copy-truncate requires two extra I/Os for every log-write. Every log-write will need to be later read back, and written out again by the copy operation. Therefore, this pattern will exhaust I/O resources more readily.

With Splunk specifically, copy-truncate requires handling a large number of additional edge-cases, such as encountering the copy in the process of being built (you would want us to recognize this as an already-handled file), and reading from an open logfile during truncation. The latter problem is potentially not solvable in complex situations.

For example, Splunk could be in a situation where it reads the first half of an event (more likely for large events), and then the file is truncated (reduced to zero length) before we can read the second half of an event. Should we send it on as is, potentially delivering a broken half-event to the index? Should we drop it, potentially losing the only half of the data we will ever gain access to?

In general, of course, Splunk should be well-behaved to the extent possible in the face of copy-truncate. The above specific situation was a case where we needed to improve and did so. Also, for applications which log to stderr or other applications which have no support for ever reopening their logfile, there may be no other option for file management than copy-truncate.

Splunk Employee
Splunk Employee

FWIW, I feel the answer by chris is the far more on-target one for the specific situation. This just felt like a good place to talk about these issues.

0 Karma

Motivator

Do you see a potential issue with number of unique sources if you are using the "name based on timestamp" pattern? I know one workaround is to modify the source field in props/transforms (ie, frog-decorator-2016-05-03.log -> frog-decorator.log), but that's not an ideal solution either.

0 Karma

Splunk Employee
Splunk Employee

Splunk has had scalability problems with large numbers of source values, and in the future could have them again as accounts may grow larger. For files named by day, however, we've never seen such problems. If you generate hundreds or thousands of files a minute, named by subsecond, for example, it's more likely to run into problems.

There is also the managability concern. How do you zero in on the data you want if source isn't a useful key? For many situations, host and sourcetype are sufficient, but for others you might need to rename, or use wildcards, or extract parts of the path from the source as the main informative element.

It's annoying, but you'd really rather solve those problems in Splunk, typically, than struggle with a logging app that crashes because it gets an error when it tries to open it's logfile for creation when the file was already deleted but is still open for writing. (Yes, windows remembers the NAME of files that are open but deleted, and blocks reuse of the NAME).

Though I agree, on at least UNIX, stable name + rotate is often the most convenient option for all parties.

0 Karma

Motivator

Apparently there is a fix for this issue which will go into the 4.3.3 release

View solution in original post

Motivator

So far the issue only occured when we use the "-c" option for logadm this corresponds to copytruncate option for logrotate. According to the logrotate man page the rotation is done quickly. I do not know how quickly though and have not tested this for logadm. Thanks for the hint I'll try to find out not sure how I can test this though.

0 Karma

Splunk Employee
Splunk Employee

Most people use logrotate and this usually works just fine. Is the new truncated file empty for a period of time?

0 Karma