We plan to use Splunk to keep log for several java application including web server like Tomcat. Those application are using log4j with org.apache.log4j.RollingFileAppender. The partial config will be like below:
log4j.appender.R.MaxFileSize=10MB
log4j.appender.R.MaxBackupIndex=20
That is when the server.log reaches 10MB, it will be renamed/rollover to server.log.1 and server.log.1 will be server.log.2 and so forth...
My questions:
Monitoring server.log only can work well, but there's an unavoidable race where we can miss the end of the file. For some users this doesn't tend to occur or they don't mind missing a few lines. I generally recommend monitoring server.log as well as server.log.1
This issue is generic to all rolling logfiles.
Splunk 4.0 and earlier wait for the file to become 5 seconds stale before closing and re-opening it (which is how the roll will get handled). If your file rolls multiple times in that 5 second window, some files will be missed entirely.
You can tune the timebeforeclose value in local/limits.conf, but there can be a performance penalty as our setup and teardown of file input streams isn't our best optimized behavior.
If you have a relatively fixed number of file inputs, and changing the logging behavior is undesirable, it might be best to kick up max_fd in limits.conf to a value larger than your input count (say 250 for 200 files), and then set dedicatedFd on for your inputs pointing at those specific files. This means splunk will more or less always be trying to keep those files open. At this point you can drop the time_before_close to a value like 1, and hopefully this will catch every roll.
Realistically you probably want your files to roll less often than this. Having your data expire from this world in 400 seconds means you'll likely lose data during spikes, or brief splunkd downtimes, such as upgrades. Maybe the total datarate is just so high you can't keep data longer than this?
Note that 4.1 rewrites the file acquisition code, so that the worst-case time to acquire active files shrinks drastically, but 2 seconds may still be stretching it for a fairly busy forwarder with many data sources.
Monitoring server.log only can work well, but there's an unavoidable race where we can miss the end of the file. For some users this doesn't tend to occur or they don't mind missing a few lines. I generally recommend monitoring server.log as well as server.log.1
This issue is generic to all rolling logfiles.
Splunk 4.0 and earlier wait for the file to become 5 seconds stale before closing and re-opening it (which is how the roll will get handled). If your file rolls multiple times in that 5 second window, some files will be missed entirely.
You can tune the timebeforeclose value in local/limits.conf, but there can be a performance penalty as our setup and teardown of file input streams isn't our best optimized behavior.
If you have a relatively fixed number of file inputs, and changing the logging behavior is undesirable, it might be best to kick up max_fd in limits.conf to a value larger than your input count (say 250 for 200 files), and then set dedicatedFd on for your inputs pointing at those specific files. This means splunk will more or less always be trying to keep those files open. At this point you can drop the time_before_close to a value like 1, and hopefully this will catch every roll.
Realistically you probably want your files to roll less often than this. Having your data expire from this world in 400 seconds means you'll likely lose data during spikes, or brief splunkd downtimes, such as upgrades. Maybe the total datarate is just so high you can't keep data longer than this?
Note that 4.1 rewrites the file acquisition code, so that the worst-case time to acquire active files shrinks drastically, but 2 seconds may still be stretching it for a fairly busy forwarder with many data sources.