Getting Data In

How do I setup Splunk to index log4j with org.apache.log4j.RollingFileAppender

Path Finder

We plan to use Splunk to keep log for several java application including web server like Tomcat. Those application are using log4j with org.apache.log4j.RollingFileAppender. The partial config will be like below:

log4j.appender.R.MaxFileSize=10MB 
log4j.appender.R.MaxBackupIndex=20

That is when the server.log reaches 10MB, it will be renamed/rollover to server.log.1 and server.log.1 will be server.log.2 and so forth...

My questions:

  1. Is that correct to setup Splunk to monitor (set the file as an input) server.log only. No need to monitor server.log.n. Is that correct?
  2. How quick can Splunk react to a change of server.log For example, if my app server write lots of log (e.g., 10MB log is written in less than 2 second), can Splunk still read and capture my log just before my server.log rollover to server.log.1? If not, how should I setup Splunk?
1 Solution

Splunk Employee
Splunk Employee

Monitoring server.log only can work well, but there's an unavoidable race where we can miss the end of the file. For some users this doesn't tend to occur or they don't mind missing a few lines. I generally recommend monitoring server.log as well as server.log.1

This issue is generic to all rolling logfiles.


Splunk 4.0 and earlier wait for the file to become 5 seconds stale before closing and re-opening it (which is how the roll will get handled). If your file rolls multiple times in that 5 second window, some files will be missed entirely.

You can tune the timebeforeclose value in local/limits.conf, but there can be a performance penalty as our setup and teardown of file input streams isn't our best optimized behavior.

If you have a relatively fixed number of file inputs, and changing the logging behavior is undesirable, it might be best to kick up max_fd in limits.conf to a value larger than your input count (say 250 for 200 files), and then set dedicatedFd on for your inputs pointing at those specific files. This means splunk will more or less always be trying to keep those files open. At this point you can drop the time_before_close to a value like 1, and hopefully this will catch every roll.

Realistically you probably want your files to roll less often than this. Having your data expire from this world in 400 seconds means you'll likely lose data during spikes, or brief splunkd downtimes, such as upgrades. Maybe the total datarate is just so high you can't keep data longer than this?

Note that 4.1 rewrites the file acquisition code, so that the worst-case time to acquire active files shrinks drastically, but 2 seconds may still be stretching it for a fairly busy forwarder with many data sources.

View solution in original post

Splunk Employee
Splunk Employee

Monitoring server.log only can work well, but there's an unavoidable race where we can miss the end of the file. For some users this doesn't tend to occur or they don't mind missing a few lines. I generally recommend monitoring server.log as well as server.log.1

This issue is generic to all rolling logfiles.


Splunk 4.0 and earlier wait for the file to become 5 seconds stale before closing and re-opening it (which is how the roll will get handled). If your file rolls multiple times in that 5 second window, some files will be missed entirely.

You can tune the timebeforeclose value in local/limits.conf, but there can be a performance penalty as our setup and teardown of file input streams isn't our best optimized behavior.

If you have a relatively fixed number of file inputs, and changing the logging behavior is undesirable, it might be best to kick up max_fd in limits.conf to a value larger than your input count (say 250 for 200 files), and then set dedicatedFd on for your inputs pointing at those specific files. This means splunk will more or less always be trying to keep those files open. At this point you can drop the time_before_close to a value like 1, and hopefully this will catch every roll.

Realistically you probably want your files to roll less often than this. Having your data expire from this world in 400 seconds means you'll likely lose data during spikes, or brief splunkd downtimes, such as upgrades. Maybe the total datarate is just so high you can't keep data longer than this?

Note that 4.1 rewrites the file acquisition code, so that the worst-case time to acquire active files shrinks drastically, but 2 seconds may still be stretching it for a fairly busy forwarder with many data sources.

View solution in original post