Since upgrading to Spunk 4.2.1 last month, I'm having trouble with logrotate causing our light forwarders to stop monitoring our production logfiles. I've added followTail = 1 and crcSalt =
05-12-2011 04:14:38.564 -0700 WARN FileInputTracker - Hit EOF while computing CRC: totalread=0/thisread=0/shouldread=256/hashbytes=256/fdpos=0
05-12-2011 04:14:39.186 -0700 ERROR TailingProcessor - Ignoring path due to: Could not checksum file='/var/log/mongrel/production.log'.
Here is the relevant section of inputs.conf for this light forwarder:
[monitor:///var/log/mongrel]
blacklist = \.gz$|csv_log|apn
disabled = false
followTail = 1
crcSalt = <SOURCE>
host = app5
sourcetype = Rails
And here is /etc/logrotate.d/mongrel
, which controls the rotation of this log:
/var/log/mongrel/*.log {
daily
missingok
dateext
rotate 7
size 500M
compress
notifempty
sharedscripts
extension gz
copytruncate
}
Here's the section of inputs.conf that worked for us under Splunk 4.1:
[monitor:///var/log/mongrel]
blacklist = \.gz$|csv_log|apn
disabled = false
followTail = 0
host = app5
sourcetype = Rails
Restarting splunk causes the file to be monitored again, but I'd like the forwarders to be able to survive the daily log rotation (as they used to with Splunk 4.1.) Any ideas?
It seems mongrel simply writes to stderr as its logging strategy with stderr already attached to a specific file. This means that copy and truncate is the only option. Which is too bad, since it's a fundamental race condition to actually keep all your data and roll.
stderr is pretty reasonable for low volume stuff that doesn't need to be rolled, but it's a poor choice for rolling.
However the real world isn't so easy, and we can't dictate how apps work. We need to look into that message, and I'll try to. I think you should have a support case open if you don't already.
It seems mongrel simply writes to stderr as its logging strategy with stderr already attached to a specific file. This means that copy and truncate is the only option. Which is too bad, since it's a fundamental race condition to actually keep all your data and roll.
stderr is pretty reasonable for low volume stuff that doesn't need to be rolled, but it's a poor choice for rolling.
However the real world isn't so easy, and we can't dictate how apps work. We need to look into that message, and I'll try to. I think you should have a support case open if you don't already.
Thanks. Still wonder why I wasn't seeing this with the older version, though it's possible this was happening and I just didn't notice. I'll open a support ticket regarding the message.
In your logrotate config, sharedscripts
implies create
(so says man logrotate
on my ubuntu natty). Depending on what your scripts do, this could be at odds with copytruncate
.
There are chances for a few different race conditions here. With copytruncate
, while logrotate is making a copy of your logfile, Splunk could see the copy (before it gets gzipped) and try to index it. With crcSalt = <SOURCE>
that could lead to some duplicate events. (Depending on the timing) Removing copytruncate
reduces the chances in that. But, there could be a different timing condition where when you rename the existing log and gzip it Splunk could miss some events near EOF.
With two shell scripts competing to add data to a log file and to rotate it as quickly as possible, I was able to make a handful of these race conditions happen. Granted it is a bit of an unfair test, because you don't try to run logrotate
multiple times each minute.
This is a little harder to do with logrotate, but procedure-wise the very cleanest approach is something like this:
This approach gives everyone what they are looking for, and minimizes the chances for race conditions and missed or duplicate events. It also doesn't need crcSalt = <SOURCE>
, followTail
or alwaysOpenFile
.
The file-in-creation situation could cause splunk to see a shorter version of the file, which we have to assume is new.
The the crcSalt interaction will force us to reindex every time the file is reanamed (rolled), so it's just unwanted.
Also we should look at what you upgraded from, and how it was configured. 4.1 should behave the same way. 4.0 is quite different.
Looking at this now, seems like copytruncate could be the culprit. Can anyone confirm or deny that?
It's possible there's a truncate scenario we don't yet handle cleanly. I anti-recommend both crcSalt and followTail for the generic case. I second the logrotate.conf information request. Alternatively a story of the seqeunce of file operations would help.
Can you update with the relevant section of logrotate.conf?
The alwaysopenfile parameter is expensive, but would likely help in this scenario.
alwaysOpenFile = [0|1]
http://www.splunk.com/base/Documentation/latest/Admin/Inputsconf
I don't believe the crcsalt will help as it is stating the same file, but I could be wrong.