Getting Data In

Whitelist for standard /var/log

cvajs
Contributor

v4.3.1 on sles 11.1

the standard whitelist for data source /var/log will produce dupe indexing because by default on sles it rotates out the messages file to another file "messages-YYYYMMDD" and will bzip that on 2nd rotation (aka, delayed compress in logrotate, etc).

default whitelist
(.log|log$|messages|secure|auth|mesg$|cron$|acpid$|.out)

default blacklist
(lastlog)

so, in my case i think changing whitelist to use ^messages$ would be better, and possibly changing some of the others like .log and log$

Tags (2)
1 Solution

jbsplunk
Splunk Employee
Splunk Employee

Splunk should not produce duplicate results because of file rotation. In the situation you reference, the CRC will match and the rotated file will be ignored. You shouldn't need to edit the whitelist in the situation you've mentioned.

Details can be found here:

http://docs.splunk.com/Documentation/Splunk/latest/Data/Howlogfilerotationishandled

The monitoring processor picks up new files and reads the first and last 256 bytes of the file. This data is hashed into a begin and end cyclic redundancy check (CRC). Splunk checks new CRCs against a database that contains all the CRCs of files Splunk has seen before. The location Splunk last read in the file, known as the file's seekPtr, is also stored.

There are three possible outcomes of a CRC check:

1. There is no begin and end CRC matching this file in the database. This indicates a new file. Splunk will pick it up and consume its data from the start of the file. Splunk updates the database with the new CRCs and seekPtrs as the file is being consumed.

2. The begin CRC and the end CRC are both present, but the size of the file is larger than the seekPtr Splunk stored. This means that, while Splunk has seen the file before, there has been data added to it since it was last read. Splunk opens the file, seeks to the previous end of the file, and starts reading from there. In this way, Splunk will only grab the new data and not anything it has read before.

3. The begin CRC is present, but the end CRC does not match. This means that Splunk has previously read the file but that some of the material that it read has since changed. In this case, Splunk must re-read the whole file.

Important: Since the CRC start check is run against only the first 256 bytes of the file, it is possible for non-duplicate files to have duplicate start CRCs, particularly if the files are ones with identical headers. To handle such situations, you can use the crcSalt attribute when configuring the file in inputs.conf, as described here. The crcSalt attribute ensures that each file has a unique CRC. You do not want to use this attribute with rolling log files, however, because it defeats Splunk's ability to recognize rolling logs and will cause Splunk to re-index the data.

View solution in original post

gkanapathy
Splunk Employee
Splunk Employee

I think it would be better to blacklist bzip and gzip (\.bz$|\.gz$|\.gzip$ or similar. If you only whitelist the first file, it's possible to miss a message that was sent the the file handle after the file was rotated/renamed the first time. You could whitelist the date-formatted file also, but because the rotation names might be more varied, I think it's easier to blacklist the delaycompressed files.

cvajs
Contributor

i thought gzip's got CRC check but bzip did not? is bz2 in the list of files for bzip? i typically blacklist \.(gz|gzip|bz|bz2|z|zip)$ so to avoid even a CRC check, etc.

blacklist for /var/log for my sles 11.1
(lastlog|\.(gz|gzip|bz|bz2|z|zip))$

also, if the file was written to after the rename the CRC check would still match since it looks at only the 1st 256bytes. in this case it seems like events can be missed since Splunk will not see the new file as new, etc.

0 Karma

jbsplunk
Splunk Employee
Splunk Employee

Splunk should not produce duplicate results because of file rotation. In the situation you reference, the CRC will match and the rotated file will be ignored. You shouldn't need to edit the whitelist in the situation you've mentioned.

Details can be found here:

http://docs.splunk.com/Documentation/Splunk/latest/Data/Howlogfilerotationishandled

The monitoring processor picks up new files and reads the first and last 256 bytes of the file. This data is hashed into a begin and end cyclic redundancy check (CRC). Splunk checks new CRCs against a database that contains all the CRCs of files Splunk has seen before. The location Splunk last read in the file, known as the file's seekPtr, is also stored.

There are three possible outcomes of a CRC check:

1. There is no begin and end CRC matching this file in the database. This indicates a new file. Splunk will pick it up and consume its data from the start of the file. Splunk updates the database with the new CRCs and seekPtrs as the file is being consumed.

2. The begin CRC and the end CRC are both present, but the size of the file is larger than the seekPtr Splunk stored. This means that, while Splunk has seen the file before, there has been data added to it since it was last read. Splunk opens the file, seeks to the previous end of the file, and starts reading from there. In this way, Splunk will only grab the new data and not anything it has read before.

3. The begin CRC is present, but the end CRC does not match. This means that Splunk has previously read the file but that some of the material that it read has since changed. In this case, Splunk must re-read the whole file.

Important: Since the CRC start check is run against only the first 256 bytes of the file, it is possible for non-duplicate files to have duplicate start CRCs, particularly if the files are ones with identical headers. To handle such situations, you can use the crcSalt attribute when configuring the file in inputs.conf, as described here. The crcSalt attribute ensures that each file has a unique CRC. You do not want to use this attribute with rolling log files, however, because it defeats Splunk's ability to recognize rolling logs and will cause Splunk to re-index the data.

cvajs
Contributor

all good info. CRC will indeed skipped non bzip rotated file, but will indeed re-index the bzip's. so i think my Q is solved.

0 Karma

gkanapathy
Splunk Employee
Splunk Employee

actually, because the new file has been bzipped, Splunk will in fact re-index. A better solution really is blacklist gzip and bzip file formats.

0 Karma
Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

Splunk is officially part of Cisco

Revolutionizing how our customers build resilience across their entire digital footprint.   Splunk ...

Splunk APM & RUM | Planned Maintenance March 26 - March 28, 2024

There will be planned maintenance for Splunk APM and RUM between March 26, 2024 and March 28, 2024 as ...