Getting Data In

What happens when a log file gets renamed and later compressed?

rnr
Path Finder

Hi,

I've looked though similar questions about log rotation and also the most related documentation topic here
http://docs.splunk.com/Documentation/Splunk/6.0.2/Data/HowLogFileRotationIsHandled

but still it wasn't clear what happens when a log file gets renamed and later compressed. Lets look at the typical scenario with nginx logs.

access.log is being written, then renamed to access-20141020.log and nginx continues to write there. Splunk should recognize this situation by default without any additional settings or tweaking, because it's the same file, correct?

But what happens when nginx switches files, especially if amount of data is quite large - say MB/s? Is splunk going to finish indexing old file, send data to forwarder, pick up new access.log and continue from there? Will it pick up new file immediately?

Compressing access-20141020.log should be straightforward with blacklisting of archive files in indexer.conf

Thank you,
Roman Naumenko

1 Solution

jrodman
Splunk Employee
Splunk Employee

Typically, when a rotated logfile is compressed, Splunk will have already read the file in the uncompressed form, and will recognize the compressed form as data it has already dealt with.

There are edge cases. If the start or end of the compressed file does not fully agree with the uncompressed version you might not get these neat and clean results, but they usually do. There are also potential problems if the file is deleted while splunk is reading it. (Compressing a rotated file means that the uncompressed version is deleted). Usually on Unix with local files this works fine, but on Windows or over some kinds of network filesystems splunk may get an error midway through the file and have to stop at that point. The full set scenarios for handling partially-read files that are then compressed is not handled, and in some cases you may see partial duplication of data.

Best practice would be to keep at least one generation of the logfile, say logfile.1 in uncompressed form, to permit Splunk plenty of time to find and acquire the data from the uncompressed form. If you rotate to logfile.2.gz, Splunk should reliably recognize it as already handled data. If, however, you have extended downtime or other problems and unhandled data is stored in logfile.2.gz by the time things are sorted out, splunk will read and index the data from logfile.2.gz as well as logfile.1 and logfile. If you would rather err on the side of incomplete data rather than the possibility of some duplicate data, you could take the approach suggested by musskopf and simply not monitor the compressed files at all.

View solution in original post

jrodman
Splunk Employee
Splunk Employee

Typically, when a rotated logfile is compressed, Splunk will have already read the file in the uncompressed form, and will recognize the compressed form as data it has already dealt with.

There are edge cases. If the start or end of the compressed file does not fully agree with the uncompressed version you might not get these neat and clean results, but they usually do. There are also potential problems if the file is deleted while splunk is reading it. (Compressing a rotated file means that the uncompressed version is deleted). Usually on Unix with local files this works fine, but on Windows or over some kinds of network filesystems splunk may get an error midway through the file and have to stop at that point. The full set scenarios for handling partially-read files that are then compressed is not handled, and in some cases you may see partial duplication of data.

Best practice would be to keep at least one generation of the logfile, say logfile.1 in uncompressed form, to permit Splunk plenty of time to find and acquire the data from the uncompressed form. If you rotate to logfile.2.gz, Splunk should reliably recognize it as already handled data. If, however, you have extended downtime or other problems and unhandled data is stored in logfile.2.gz by the time things are sorted out, splunk will read and index the data from logfile.2.gz as well as logfile.1 and logfile. If you would rather err on the side of incomplete data rather than the possibility of some duplicate data, you could take the approach suggested by musskopf and simply not monitor the compressed files at all.

rnr
Path Finder

Thanks you guys for a comprehensive explanation.

0 Karma

musskopf
Builder

Hi Roman,

From my experience Splunk basically watch 2 things from the monitored files:
- checksum for the first bytes of the file;
- and the size of the file.

Based on that, every time the filesize changes or a new file appears, it'll check the beginning of the file to confirm the file is new or just the same file with more content. Being the same (same checksum and size bigger than last time), it'll go the the last line indexed and continue from there. If the Splunk identify the file as being new, it'll start from line 1. The filename doesn't matter much, it just need to match the naming filter you're monitoring.

In you case, if you monitor "access*.log" should do the trick. Make sure the compressed files get renamed to access-xxxxxxx.log.gz so they are not included.

Also, if you have too many days of logs in the same location, is recommended to include ignoreOlderThan = 2d, for example... so Splunk won't monitor the older files anymore.

Also, you mentioned MB/s... please note that Splunk Universal Forwarder is limited to 256kbps of throughput. You might need to increase that to keep up with the data volume you're generating.

Hope it helps!
Cheers

0 Karma

rnr
Path Finder

Ok, thanks for the suggestion.

How is the old file being handled? Does splunk recognize that it's just renamed file and indexing should continue?

0 Karma

musskopf
Builder

I don't think Splunk cares about the file name at all... if a file is rename and the checksum still the same, it won't re-index the file... if it got renamed and the file receive more content, Splunk should grab only the deltas.

0 Karma
Get Updates on the Splunk Community!

Monitoring Postgres with OpenTelemetry

Behind every business-critical application, you’ll find databases. These behind-the-scenes stores power ...

Mastering Synthetic Browser Testing: Pro Tips to Keep Your Web App Running Smoothly

To start, if you're new to synthetic monitoring, I recommend exploring this synthetic monitoring overview. In ...

Splunk Edge Processor | Popular Use Cases to Get Started with Edge Processor

Splunk Edge Processor offers more efficient, flexible data transformation – helping you reduce noise, control ...