Log rotation best practice is to use rename-and-recreate, or time-based logfile names (which in truth is not rotation at all, but still a good practice in some cases.)
Rename-and-recreate is the most common pattern, and is used often by both external rotation programs, and by programs that self-rotate.
In this pattern, the logging application typically writes to a file with a fixed name. For example if the application is called Frog Decorator, the output might got to /var/log/frog-decorator.log or /opt/frog-decorator/logs/frog-decorator.log etc. If the application is self-rotating, it will at some point (daily, perhaps, or based on size-policy), rename frog-decorator.log to another name, eg frog-decorator.log.1, and then it will open a new frog-decorator.log file and write there.
It is possible for self-rotating programs to cause all writes to always land in the stable name, but in multithreaded applications, it is common that some writes will land in the renamed file during the rotation, because continued operation of those threads is valued over neat and clean write patterns to log files.
External rename & recreate is essentially the same, but requires a method of co-ordination between the rotation program and the logging application. After the rotation program has renamed the original log file, it must tell the original application that this has occurred, so the logging application knows it is time to re-open the 'frog-decorator.log', causing the application to create the new log, and land its log writes in that location. The typical communication path on unix is to send a signal to the application, often via the Hangup signal (SIGHUP), but to some degree it varies. Some programs provide a tool to tell them to reopen their logfile.
For the external rename and recreate it is unavoidable that some log writes will land in the renamed logfile.1, but this is not particularly harmful. However it is essential that the application be notified that it need to recreate the original file BEFORE you compress logfile.1. Remember that "compressing a log file" is actually multiple steps:
1 - Read the contents of the original file
2 - compress those contents in memory
3 - write out the compressed contents to a new file
4 - delete the original file
If you "compress" logfile.1 too early, you run the risk of hitting step 4, deleting the original, while the application is still writing to the file. This will cause data loss.
In general, compressing the first logfile can introduce races with the original app, as well as some races with Splunk. I strongly suggest waiting until the second backup, logfile.2 to begin compression.
The other high-quality pattern, with its own tradeoffs for management, is to simply name the logfiles uniquely on original creation.
For example, if the application "rotates" daily, it could create filenames by the date, such as frog-decorator-2016-05-03.log . When the day changes, it can simply create a new logfile and redirect its output to the new file.
Time-based filenames can be valuable especially on Microsoft Windows platforms, where renaming open files is a very complicated and difficult business. It is also well-behaved, like rename-and-recreate, in that the data persistence story is very simple.
The copy-truncate pattern is lower quality, because there is no way to ensure all data is retained. There is an inherent race condition between the logging application and the program performing the copy & truncate. Data can be written to the file after the copy and before the truncate. This data will be lost.
Additionally, copy-truncate requires two extra I/Os for every log-write. Every log-write will need to be later read back, and written out again by the copy operation. Therefore, this pattern will exhaust I/O resources more readily.
With Splunk specifically, copy-truncate requires handling a large number of additional edge-cases, such as encountering the copy in the process of being built (you would want us to recognize this as an already-handled file), and reading from an open logfile during truncation. The latter problem is potentially not solvable in complex situations.
For example, Splunk could be in a situation where it reads the first half of an event (more likely for large events), and then the file is truncated (reduced to zero length) before we can read the second half of an event. Should we send it on as is, potentially delivering a broken half-event to the index? Should we drop it, potentially losing the only half of the data we will ever gain access to?
In general, of course, Splunk should be well-behaved to the extent possible in the face of copy-truncate. The above specific situation was a case where we needed to improve and did so. Also, for applications which log to stderr or other applications which have no support for ever reopening their logfile, there may be no other option for file management than copy-truncate.
... View more