Getting Data In

Why is a .gz file created by log rotation indexed again by Splunk ?

Communicator

I have a log file called testlogs.log and once hits a specific size, it rotates to create testlogs.log.1.gz. I monitor .log file and all the zipped .gz files. Something that I noticed is that testlogs.log.1.gz(which had the same data as in testlogs.log) is also indexed by splunk.

Won't the .log file and .log.1.gz have the same crcSalt as they have the same data in it?

Edit: I thought I will make this clear. I'm not writing anything to the .gz files. Logs are written only to .log files.
Thanks

I can confirm splunk is not handling compressed log rotated files correctly.

I copied a current log to .2s.

cp ptpe-service.log ptpe-service.2s.log

it correctly matched it with current log and updated the seek address accordingly

06-27-2016 16:54:45.871 -0700 DEBUG WatchedFile - seeking /base/logs/web/ptpe/ptpe-service.2s.log to off=192683

But when I compressed the same (gzip ptpe-service.2s.log), it couldn’t match it with current log

INFO ArchiveProcessor - reading path=/base/logs/web/ptpe/ptpe-service.2s.log.gz (seek=0 len=16227)

Looks like splunk doesn’t handle compressed files well.

I also tried to copy a .gz into another .gz, it wasnt reindexed this time though...

0 Karma

SplunkTrust
SplunkTrust

Everyone else's answers are great, but to add just a little more info:

First of all, crcSalt really only supports one option and that is crcSalt = \. If you set crcSalt= \ you should expect files with the same content but different names to get reindexed over and over. This is usually NOT what you want, so using crcSalt at all must come with a grain of salt... sorry for the bad joke.

Second, race conditions. Splunk is usually smart enough to look inside compressed files and see if they are equivalent to a file Splunk has already processed in uncompressed form and then skip them. But, depending on how your log compressor and rotator works it could be making for a race condition where Splunk sees the new xxx.log.gz while it is still being built. While this file is still being built, Splunk will see the initial CRC of the first 256 bytes of the file and check it against the fishbucket. When a initial CRC is matched, but the 'new' file is smaller than Splunk's first file with that CRC Splunk will decide to process the new file ... creating duplicate events.

You should make sure that your compressor / rotator tools always create a temporary file and then when the compression is done, rename that file to a .gz. This way, Splunk and the compressor never get into a race condition where the state of the compressed file is in flux.

Also, as suggested above, blacklisting compressed files is a great way to avoid this entirely.

Note - the crcSalt = \ above is being eaten by something in answers. I'm trying to make it crcSalt = (less_than)SOURCE(greater_than)

SplunkTrust
SplunkTrust

test to see if crcSalt = is supported in comments

0 Karma

Communicator

That was interesting. I looked into the way my log rotation works.
1) testlog.log is created when logging starts.
2) Once it hits the max size, it is rotated to test
log.log.1.
3) testlog.log.1 is now compressed to testlog.log.1.gz.
4) testlog.log.1.gz that was already existing is renamed to testlog.log.2.gz

I see the possibility of the .gz file being looked at by splunk while it is being created and hence getting re-indexed and as far as I understand it should only be a problem with the testlog.log and testlog.log.1.gz as that is the only place where compression happens. For other files just renaming happens.
To test this out I blacklisted 1.gz file. This time I see, testlog.log and testlog.log.2.gz files indexed causing duplicates. Also testlog.log.3.gz and testlog.log.4.gz are not indexed.

0 Karma

Influencer

Since your GZ files are just rotated logs, and Splunk is actively indexing your .LOG files, there is really no advantage to having Splunk index the logs again once they've rotated into a .GZ format. In fact, doing so, is likely counting against your license. I would recommend that you blacklist the .GZ files and whitelist the .LOG files in your inputs.conf. Here's an example of how to do so with your stanza:

[monitor:///Users/myusername/Desktop/splunkmonitorlogs]
whitelist = .log$
blacklist = .gz$
disabled = false
followTail = 0
index = myindex
sourcetype = test
source

Communicator

Thanks for the response. I have explained the reason why I'm indexing the gz logs here http://answers.splunk.com/comments/223267/view.html.

Based on my understanding (Please correct me if I'm wrong), before indexing it looks at the first 256 bytes to see if it is already indexed. According to my usecase, most of the time It wont have to index the gz files as they will already be indexed. I'm doing this just to make sure that I don't loose any logs.

0 Karma

SplunkTrust
SplunkTrust

I believe you're using crcSalt for wrong purpose here. See the full description of the crcSalt attribute here.
http://docs.splunk.com/Documentation/Splunk/6.1.2/admin/Inputsconf

I guess you wanted to provide the file name here so that all other files will get filtered out. For that just try this (no crcSalt)

[monitor:///Users/myusername/Desktop/splunk_monitor_logs/test_log.log]
 disabled = false
 followTail = 0
 index = my_index
 sourcetype = test_source

Communicator

You are right, I asked the wrong question. Even if I don't add crcSalt to this attribute, it shouldn't index the .gz files as I would assume the first 256 bytes of the .log and .log.1.gz are the same. There are couple of things, I think I was not clear about in the question.
1) I have to monitor the zipped (gz) logs as well. This is because I have huge of chunk of data getting logged in testlogs.log and at times the testlogs.log gets rotated even before those events are indexed. Looking into all the zipped files will help prevent the issue of missing logs to a larger extend.
2) Why I included the crcSalt here ?
I have another log file called newlog.log (under another monitor) which has the same first 256 bytes as testlog.log. I added crcsalt=newlog for that monitor and crcSalt=testlog in the above monitor as I don't want to miss any one of those logs as splunk can skip indexing the file because of the same first 256 bytes.

Even after removing the crcSalt I still see the same issue.

0 Karma

Motivator

To summarize, due to a high logging rate, sounds like logs are rotating before Splunk gets to EOF?
You may want to check out this post for more info: http://answers.splunk.com/answers/58549/high-volume-log-rotation.html

0 Karma

Communicator

I came across that post when I was looking for a solution for missing logs due to high rate of logging.
The solution discussed there was to use the timebeforeclose attribute. My logger doesn't produce logs continuously. It logs a large chunk of data into the log files every 10 minutes. So I will have to keep the timebeforeclose to a value slightly more than 10 mins to get good results. Also there are other use cases where I may end up loosing logs even after setting this attribute. That was the reason why I decided to index the zipped files as well.

0 Karma

Motivator

Are you able to control the rotation of your log files so that they are not rotated as quickly? i.e. allow them to grow to 1GB, etc. Seems it would be preferable to have Splunk caught up to the EOF before the rotation.

0 Karma

Communicator

I was planning to increase the log file size if I cannot find an solution to this problem even though it is not a preferred solution. As mentioned in a previous comment I don't have a continuous flow of data into the log file. Logging happens only once in 5 minutes. So I decided to have a timebeforeclose attribute in the inputs.conf so that the file handler doesn't close and I end up missing logs. But then I came across this problem
http://answers.splunk.com/answers/224653/why-is-time-before-close-attribute-causing-a-delay.html

0 Karma

Motivator

Do you have something like crcSalt= in your monitor rule? If so, that's what causing Splunk to reindex.

See this document:
http://docs.splunk.com/Documentation/Splunk/6.2.2/Data/Howlogfilerotationishandled

I would also recommend posting your config from inputs.conf.

0 Karma

Communicator

Thanks for the response. Following is my inputs.conf

[monitor:///Users/myusername/Desktop/splunk_monitor_logs]
disabled = false
followTail = 0
index = my_index
sourcetype = test_source
crcSalt = test_log
0 Karma

Champion

if you are using crsSalt you want to crcSalt=<SOURCE> . Additionally you can blacklist using regex, blacklist = .gz

Communicator

Hey,
I'm indexing .gz files as well. So if add crcSalt=SOURCE , then I will get duplicate events as every .gz file will be identified as a new file.

0 Karma

Motivator

My guess is that it reindexed when you added crcSalt = testlog to your config. So Splunk would see it once without the crcSalt and once with it and grabbed the log twice. Since testlog is a static crcSalt, I'm expecting this behavior won't repeat for any new logs. There's typically not a need to specify a static crcSalt unless you are trying to do a one-time reindex of something.

One way you could test this, assuming you have all the source logs and don't mind removing and reindexing. Comment out the crcSalt line. Replace the yourindexnamehere below with the actual index name below.
- stop splunk
- splunk clean eventdata yourindexnamehere
- splunk start

Splunk should index everything available once.

Once the indexing completed, uncomment the crcSalt = test_log, I would expect all existing logs to get indexed a 2nd time and new logs to only be indexed once.

On a side note, CRCsalt is more often used to force the indexing of files that Splunk skips due to two files having an identical file header, however the content following the header is unique. Splunk only checks the first 256 bytes of the file by default and may incorrectly assume it's already indexed a file in some cases where the header is longer than 256 bytes.

0 Karma

Communicator

I made a better explanation of my situation here.
http://answers.splunk.com/answers/223263/why-is-a-gz-file-created-by-log-rotation-indexed-a.html#com...

As described in the link mentioned above, I think I don't even need to have the crcSalt attribute option as I assume the first 256 bytes of the .log and .log.1.gz are the same and so splunk should skip indexing that .gz file. I have also explained the reason why I have to use crcSalt in the same comment.

0 Karma