Getting Data In

"Hit EOF while computing CRC" after logrotate runs

Explorer

Since upgrading to Spunk 4.2.1 last month, I'm having trouble with logrotate causing our light forwarders to stop monitoring our production logfiles. I've added followTail = 1 and crcSalt = to the inputs.conf file to try to keep Splunk monitoring the file even after it's been truncated to 0 bytes by logrotate, but I'm still seeing this error sporadically:

05-12-2011 04:14:38.564 -0700 WARN  FileInputTracker - Hit EOF while computing CRC: totalread=0/thisread=0/shouldread=256/hashbytes=256/fdpos=0
05-12-2011 04:14:39.186 -0700 ERROR TailingProcessor - Ignoring path due to: Could not checksum file='/var/log/mongrel/production.log'.

Here is the relevant section of inputs.conf for this light forwarder:

[monitor:///var/log/mongrel]
blacklist = \.gz$|csv_log|apn
disabled = false
followTail = 1
crcSalt = <SOURCE>
host = app5
sourcetype = Rails

And here is /etc/logrotate.d/mongrel, which controls the rotation of this log:

/var/log/mongrel/*.log {
  daily
  missingok
  dateext
  rotate 7
  size 500M
  compress
  notifempty
  sharedscripts
  extension gz
  copytruncate
}

Here's the section of inputs.conf that worked for us under Splunk 4.1:

[monitor:///var/log/mongrel]
blacklist = \.gz$|csv_log|apn
disabled = false
followTail = 0
host = app5
sourcetype = Rails

Restarting splunk causes the file to be monitored again, but I'd like the forwarders to be able to survive the daily log rotation (as they used to with Splunk 4.1.) Any ideas?

1 Solution

Splunk Employee
Splunk Employee

It seems mongrel simply writes to stderr as its logging strategy with stderr already attached to a specific file. This means that copy and truncate is the only option. Which is too bad, since it's a fundamental race condition to actually keep all your data and roll.

stderr is pretty reasonable for low volume stuff that doesn't need to be rolled, but it's a poor choice for rolling.

However the real world isn't so easy, and we can't dictate how apps work. We need to look into that message, and I'll try to. I think you should have a support case open if you don't already.

View solution in original post

Splunk Employee
Splunk Employee

It seems mongrel simply writes to stderr as its logging strategy with stderr already attached to a specific file. This means that copy and truncate is the only option. Which is too bad, since it's a fundamental race condition to actually keep all your data and roll.

stderr is pretty reasonable for low volume stuff that doesn't need to be rolled, but it's a poor choice for rolling.

However the real world isn't so easy, and we can't dictate how apps work. We need to look into that message, and I'll try to. I think you should have a support case open if you don't already.

View solution in original post

Explorer

Thanks. Still wonder why I wasn't seeing this with the older version, though it's possible this was happening and I just didn't notice. I'll open a support ticket regarding the message.

0 Karma

SplunkTrust
SplunkTrust

In your logrotate config, sharedscripts implies create (so says man logrotate on my ubuntu natty). Depending on what your scripts do, this could be at odds with copytruncate.

There are chances for a few different race conditions here. With copytruncate, while logrotate is making a copy of your logfile, Splunk could see the copy (before it gets gzipped) and try to index it. With crcSalt = <SOURCE> that could lead to some duplicate events. (Depending on the timing) Removing copytruncate reduces the chances in that. But, there could be a different timing condition where when you rename the existing log and gzip it Splunk could miss some events near EOF.

With two shell scripts competing to add data to a log file and to rotate it as quickly as possible, I was able to make a handful of these race conditions happen. Granted it is a bit of an unfair test, because you don't try to run logrotate multiple times each minute.

This is a little harder to do with logrotate, but procedure-wise the very cleanest approach is something like this:

  1. Rename xxx.log to xxx.log.
  2. Create a new xxx.log as a 0-byte file
  3. SIGHUP (or whichever signal it needs) mongrel to have it swap files
  4. Give Splunk "some time" (say 5 minutes) to "finish up" the log file named xxx.log.. It will notice the CRC and pick up where it left off.
  5. gzip xxx.log.

This approach gives everyone what they are looking for, and minimizes the chances for race conditions and missed or duplicate events. It also doesn't need crcSalt = <SOURCE>, followTail or alwaysOpenFile.

0 Karma

Splunk Employee
Splunk Employee

The file-in-creation situation could cause splunk to see a shorter version of the file, which we have to assume is new.

The the crcSalt interaction will force us to reindex every time the file is reanamed (rolled), so it's just unwanted.

0 Karma

Splunk Employee
Splunk Employee

Also we should look at what you upgraded from, and how it was configured. 4.1 should behave the same way. 4.0 is quite different.

0 Karma

Explorer

Looking at this now, seems like copytruncate could be the culprit. Can anyone confirm or deny that?

0 Karma

Splunk Employee
Splunk Employee

It's possible there's a truncate scenario we don't yet handle cleanly. I anti-recommend both crcSalt and followTail for the generic case. I second the logrotate.conf information request. Alternatively a story of the seqeunce of file operations would help.

0 Karma

SplunkTrust
SplunkTrust

Can you update with the relevant section of logrotate.conf?

0 Karma

Splunk Employee
Splunk Employee

The alwaysopenfile parameter is expensive, but would likely help in this scenario.

alwaysOpenFile = [0|1]

http://www.splunk.com/base/Documentation/latest/Admin/Inputsconf

I don't believe the crcsalt will help as it is stating the same file, but I could be wrong.

0 Karma
State of Splunk Careers

Access the Splunk Careers Report to see real data that shows how Splunk mastery increases your value and job satisfaction.

Find out what your skills are worth!