Knowledge Management

About log rotation best practices

yutaka1005
Builder

Now I am planning to rotate the logs being monitored by Splunk.

I will also use the compress option to rotate.

But one thing to worry about...
If log rotation is done and files are compressed before splunk has finished loading the logs, will the logs be missing on Splunk?

Is there also a recommended setting of rotate etc?

0 Karma
1 Solution

inventsekar
SplunkTrust
SplunkTrust

from a similar post -

Log rotation best practice is to use rename-and-recreate, or time-based logfile names (which in truth is not rotation at all, but still a good practice in some cases.)

Rename-and-recreate is the most common pattern, and is used often by both external rotation programs, and by programs that self-rotate.

In this pattern, the logging application typically writes to a file with a fixed name. For example if the application is called Frog Decorator, the output might got to /var/log/frog-decorator.log or /opt/frog-decorator/logs/frog-decorator.log etc. If the application is self-rotating, it will at some point (daily, perhaps, or based on size-policy), rename frog-decorator.log to another name, eg frog-decorator.log.1, and then it will open a new frog-decorator.log file and write there.

It is possible for self-rotating programs to cause all writes to always land in the stable name, but in multithreaded applications, it is common that some writes will land in the renamed file during the rotation, because continued operation of those threads is valued over neat and clean write patterns to log files.

External rename & recreate is essentially the same, but requires a method of co-ordination between the rotation program and the logging application. After the rotation program has renamed the original log file, it must tell the original application that this has occurred, so the logging application knows it is time to re-open the 'frog-decorator.log', causing the application to create the new log, and land its log writes in that location. The typical communication path on unix is to send a signal to the application, often via the Hangup signal (SIGHUP), but to some degree it varies. Some programs provide a tool to tell them to reopen their logfile.

For the external rename and recreate it is unavoidable that some log writes will land in the renamed logfile.1, but this is not particularly harmful. However it is essential that the application be notified that it need to recreate the original file BEFORE you compress logfile.1. Remember that "compressing a log file" is actually multiple steps:

1 - Read the contents of the original file
2 - compress those contents in memory
3 - write out the compressed contents to a new file
4 - delete the original file

If you "compress" logfile.1 too early, you run the risk of hitting step 4, deleting the original, while the application is still writing to the file. This will cause data loss.

In general, compressing the first logfile can introduce races with the original app, as well as some races with Splunk. I strongly suggest waiting until the second backup, logfile.2 to begin compression.

The other high-quality pattern, with its own tradeoffs for management, is to simply name the logfiles uniquely on original creation.

For example, if the application "rotates" daily, it could create filenames by the date, such as frog-decorator-2016-05-03.log. When the day changes, it can simply create a new logfile and redirect its output to the new file.

Time-based filenames can be valuable especially on Microsoft Windows platforms, where renaming open files is a very complicated and difficult business. It is also well-behaved, like rename-and-recreate, in that the data persistence story is very simple.

The copy-truncate pattern is lower quality, because there is no way to ensure all data is retained. There is an inherent race condition between the logging application and the program performing the copy & truncate. Data can be written to the file after the copy and before the truncate. This data will be lost.

Additionally, copy-truncate requires two extra I/Os for every log-write. Every log-write will need to be later read back, and written out again by the copy operation. Therefore, this pattern will exhaust I/O resources more readily.

With Splunk specifically, copy-truncate requires handling a large number of additional edge-cases, such as encountering the copy in the process of being built (you would want us to recognize this as an already-handled file), and reading from an open logfile during truncation. The latter problem is potentially not solvable in complex situations.

For example, Splunk could be in a situation where it reads the first half of an event (more likely for large events), and then the file is truncated (reduced to zero length) before we can read the second half of an event. Should we send it on as is, potentially delivering a broken half-event to the index? Should we drop it, potentially losing the only half of the data we will ever gain access to?

In general, of course, Splunk should be well-behaved to the extent possible in the face of copy-truncate. The above specific situation was a case where we needed to improve and did so. Also, for applications which log to stderr or other applications which have no support for ever reopening their logfile, there may be no other option for file management than copy-truncate.

Splunk has had scalability problems with large numbers of source values, and in the future could have them again as accounts may grow larger. For files named by day, however, we've never seen such problems. If you generate hundreds or thousands of files a minute, named by subsecond, for example, it's more likely to run into problems.

There is also the managability concern. How do you zero in on the data you want if source isn't a useful key? For many situations, host and sourcetype are sufficient, but for others you might need to rename, or use wildcards, or extract parts of the path from the source as the main informative element.

It's annoying, but you'd really rather solve those problems in Splunk, typically, than struggle with a logging app that crashes because it gets an error when it tries to open it's logfile for creation when the file was already deleted but is still open for writing. (Yes, windows remembers the NAME of files that are open but deleted, and blocks reuse of the NAME).

Though I agree, on at least UNIX, stable name + rotate is often the most convenient option for all parties.

View solution in original post

inventsekar
SplunkTrust
SplunkTrust

from a similar post -

Log rotation best practice is to use rename-and-recreate, or time-based logfile names (which in truth is not rotation at all, but still a good practice in some cases.)

Rename-and-recreate is the most common pattern, and is used often by both external rotation programs, and by programs that self-rotate.

In this pattern, the logging application typically writes to a file with a fixed name. For example if the application is called Frog Decorator, the output might got to /var/log/frog-decorator.log or /opt/frog-decorator/logs/frog-decorator.log etc. If the application is self-rotating, it will at some point (daily, perhaps, or based on size-policy), rename frog-decorator.log to another name, eg frog-decorator.log.1, and then it will open a new frog-decorator.log file and write there.

It is possible for self-rotating programs to cause all writes to always land in the stable name, but in multithreaded applications, it is common that some writes will land in the renamed file during the rotation, because continued operation of those threads is valued over neat and clean write patterns to log files.

External rename & recreate is essentially the same, but requires a method of co-ordination between the rotation program and the logging application. After the rotation program has renamed the original log file, it must tell the original application that this has occurred, so the logging application knows it is time to re-open the 'frog-decorator.log', causing the application to create the new log, and land its log writes in that location. The typical communication path on unix is to send a signal to the application, often via the Hangup signal (SIGHUP), but to some degree it varies. Some programs provide a tool to tell them to reopen their logfile.

For the external rename and recreate it is unavoidable that some log writes will land in the renamed logfile.1, but this is not particularly harmful. However it is essential that the application be notified that it need to recreate the original file BEFORE you compress logfile.1. Remember that "compressing a log file" is actually multiple steps:

1 - Read the contents of the original file
2 - compress those contents in memory
3 - write out the compressed contents to a new file
4 - delete the original file

If you "compress" logfile.1 too early, you run the risk of hitting step 4, deleting the original, while the application is still writing to the file. This will cause data loss.

In general, compressing the first logfile can introduce races with the original app, as well as some races with Splunk. I strongly suggest waiting until the second backup, logfile.2 to begin compression.

The other high-quality pattern, with its own tradeoffs for management, is to simply name the logfiles uniquely on original creation.

For example, if the application "rotates" daily, it could create filenames by the date, such as frog-decorator-2016-05-03.log. When the day changes, it can simply create a new logfile and redirect its output to the new file.

Time-based filenames can be valuable especially on Microsoft Windows platforms, where renaming open files is a very complicated and difficult business. It is also well-behaved, like rename-and-recreate, in that the data persistence story is very simple.

The copy-truncate pattern is lower quality, because there is no way to ensure all data is retained. There is an inherent race condition between the logging application and the program performing the copy & truncate. Data can be written to the file after the copy and before the truncate. This data will be lost.

Additionally, copy-truncate requires two extra I/Os for every log-write. Every log-write will need to be later read back, and written out again by the copy operation. Therefore, this pattern will exhaust I/O resources more readily.

With Splunk specifically, copy-truncate requires handling a large number of additional edge-cases, such as encountering the copy in the process of being built (you would want us to recognize this as an already-handled file), and reading from an open logfile during truncation. The latter problem is potentially not solvable in complex situations.

For example, Splunk could be in a situation where it reads the first half of an event (more likely for large events), and then the file is truncated (reduced to zero length) before we can read the second half of an event. Should we send it on as is, potentially delivering a broken half-event to the index? Should we drop it, potentially losing the only half of the data we will ever gain access to?

In general, of course, Splunk should be well-behaved to the extent possible in the face of copy-truncate. The above specific situation was a case where we needed to improve and did so. Also, for applications which log to stderr or other applications which have no support for ever reopening their logfile, there may be no other option for file management than copy-truncate.

Splunk has had scalability problems with large numbers of source values, and in the future could have them again as accounts may grow larger. For files named by day, however, we've never seen such problems. If you generate hundreds or thousands of files a minute, named by subsecond, for example, it's more likely to run into problems.

There is also the managability concern. How do you zero in on the data you want if source isn't a useful key? For many situations, host and sourcetype are sufficient, but for others you might need to rename, or use wildcards, or extract parts of the path from the source as the main informative element.

It's annoying, but you'd really rather solve those problems in Splunk, typically, than struggle with a logging app that crashes because it gets an error when it tries to open it's logfile for creation when the file was already deleted but is still open for writing. (Yes, windows remembers the NAME of files that are open but deleted, and blocks reuse of the NAME).

Though I agree, on at least UNIX, stable name + rotate is often the most convenient option for all parties.

yutaka1005
Builder

Hello inventseker,

Thank you for reply.

I understood that normal rotate has risk of log lost.
And I understood that delaycompress option is better.

0 Karma
Get Updates on the Splunk Community!

New in Observability - Improvements to Custom Metrics SLOs, Log Observer Connect & ...

The latest enhancements to the Splunk observability portfolio deliver improved SLO management accuracy, better ...

Improve Data Pipelines Using Splunk Data Management

  Register Now   This Tech Talk will explore the pipeline management offerings Edge Processor and Ingest ...

3-2-1 Go! How Fast Can You Debug Microservices with Observability Cloud?

Register Join this Tech Talk to learn how unique features like Service Centric Views, Tag Spotlight, and ...