Getting Data In

Why is a .gz file created by log rotation indexed again by Splunk ?

nibinabr
Communicator

I have a log file called test_logs.log and once hits a specific size, it rotates to create test_logs.log.1.gz. I monitor .log file and all the zipped .gz files. Something that I noticed is that test_logs.log.1.gz(which had the same data as in test_logs.log) is also indexed by splunk.

Won't the .log file and .log.1.gz have the same crcSalt as they have the same data in it?

Edit: I thought I will make this clear. I'm not writing anything to the .gz files. Logs are written only to .log files.
Thanks

sanchitguptaiit
Explorer

I can confirm splunk is not handling compressed log rotated files correctly.

I copied a current log to .2s.

cp ptpe-service.log ptpe-service.2s.log

it correctly matched it with current log and updated the seek address accordingly

06-27-2016 16:54:45.871 -0700 DEBUG WatchedFile - seeking /base/logs/web/ptpe/ptpe-service.2s.log to off=192683

But when I compressed the same (gzip ptpe-service.2s.log), it couldn’t match it with current log

INFO ArchiveProcessor - reading path=/base/logs/web/ptpe/ptpe-service.2s.log.gz (seek=0 len=16227)

Looks like splunk doesn’t handle compressed files well.

I also tried to copy a .gz into another .gz, it wasnt reindexed this time though...

0 Karma

dwaddle
SplunkTrust
SplunkTrust

Everyone else's answers are great, but to add just a little more info:

First of all, crcSalt really only supports one option and that is crcSalt = \. If you set crcSalt= \ you should expect files with the same content but different names to get reindexed over and over. This is usually NOT what you want, so using crcSalt at all must come with a grain of salt... sorry for the bad joke.

Second, race conditions. Splunk is usually smart enough to look inside compressed files and see if they are equivalent to a file Splunk has already processed in uncompressed form and then skip them. But, depending on how your log compressor and rotator works it could be making for a race condition where Splunk sees the new xxx.log.gz while it is still being built. While this file is still being built, Splunk will see the initial CRC of the first 256 bytes of the file and check it against the fishbucket. When a initial CRC is matched, but the 'new' file is smaller than Splunk's first file with that CRC Splunk will decide to process the new file ... creating duplicate events.

You should make sure that your compressor / rotator tools always create a temporary file and then when the compression is done, rename that file to a .gz. This way, Splunk and the compressor never get into a race condition where the state of the compressed file is in flux.

Also, as suggested above, blacklisting compressed files is a great way to avoid this entirely.

Note - the crcSalt = \ above is being eaten by something in answers. I'm trying to make it crcSalt = (less_than)SOURCE(greater_than)

dwaddle
SplunkTrust
SplunkTrust

test to see if crcSalt = is supported in comments

0 Karma

nibinabr
Communicator

That was interesting. I looked into the way my log rotation works.
1) test_log.log is created when logging starts.
2) Once it hits the max size, it is rotated to test_log.log.1.
3) test_log.log.1 is now compressed to test_log.log.1.gz.
4) test_log.log.1.gz that was already existing is renamed to test_log.log.2.gz

I see the possibility of the .gz file being looked at by splunk while it is being created and hence getting re-indexed and as far as I understand it should only be a problem with the test_log.log and test_log.log.1.gz as that is the only place where compression happens. For other files just renaming happens.
To test this out I blacklisted 1.gz file. This time I see, test_log.log and test_log.log.2.gz files indexed causing duplicates. Also test_log.log.3.gz and test_log.log.4.gz are not indexed.

0 Karma

masonmorales
Influencer

Since your GZ files are just rotated logs, and Splunk is actively indexing your .LOG files, there is really no advantage to having Splunk index the logs again once they've rotated into a .GZ format. In fact, doing so, is likely counting against your license. I would recommend that you blacklist the .GZ files and whitelist the .LOG files in your inputs.conf. Here's an example of how to do so with your stanza:

[monitor:///Users/myusername/Desktop/splunk_monitor_logs]
whitelist = .log$
blacklist = .gz$
disabled = false
followTail = 0
index = my_index
sourcetype = test_source

nibinabr
Communicator

Thanks for the response. I have explained the reason why I'm indexing the gz logs here http://answers.splunk.com/comments/223267/view.html.

Based on my understanding (Please correct me if I'm wrong), before indexing it looks at the first 256 bytes to see if it is already indexed. According to my usecase, most of the time It wont have to index the gz files as they will already be indexed. I'm doing this just to make sure that I don't loose any logs.

0 Karma

somesoni2
Revered Legend

I believe you're using crcSalt for wrong purpose here. See the full description of the crcSalt attribute here.
http://docs.splunk.com/Documentation/Splunk/6.1.2/admin/Inputsconf

I guess you wanted to provide the file name here so that all other files will get filtered out. For that just try this (no crcSalt)

[monitor:///Users/myusername/Desktop/splunk_monitor_logs/test_log.log]
 disabled = false
 followTail = 0
 index = my_index
 sourcetype = test_source

nibinabr
Communicator

You are right, I asked the wrong question. Even if I don't add crcSalt to this attribute, it shouldn't index the .gz files as I would assume the first 256 bytes of the .log and .log.1.gz are the same. There are couple of things, I think I was not clear about in the question.
1) I have to monitor the zipped (gz) logs as well. This is because I have huge of chunk of data getting logged in test_logs.log and at times the test_logs.log gets rotated even before those events are indexed. Looking into all the zipped files will help prevent the issue of missing logs to a larger extend.
2) Why I included the crcSalt here ?
I have another log file called new_log.log (under another monitor) which has the same first 256 bytes as test_log.log. I added crcsalt=new_log for that monitor and crcSalt=test_log in the above monitor as I don't want to miss any one of those logs as splunk can skip indexing the file because of the same first 256 bytes.

Even after removing the crcSalt I still see the same issue.

0 Karma

bandit
Motivator

To summarize, due to a high logging rate, sounds like logs are rotating before Splunk gets to EOF?
You may want to check out this post for more info: http://answers.splunk.com/answers/58549/high-volume-log-rotation.html

0 Karma

nibinabr
Communicator

I came across that post when I was looking for a solution for missing logs due to high rate of logging.
The solution discussed there was to use the time_before_close attribute. My logger doesn't produce logs continuously. It logs a large chunk of data into the log files every 10 minutes. So I will have to keep the time_before_close to a value slightly more than 10 mins to get good results. Also there are other use cases where I may end up loosing logs even after setting this attribute. That was the reason why I decided to index the zipped files as well.

0 Karma

bandit
Motivator

Are you able to control the rotation of your log files so that they are not rotated as quickly? i.e. allow them to grow to 1GB, etc. Seems it would be preferable to have Splunk caught up to the EOF before the rotation.

0 Karma

nibinabr
Communicator

I was planning to increase the log file size if I cannot find an solution to this problem even though it is not a preferred solution. As mentioned in a previous comment I don't have a continuous flow of data into the log file. Logging happens only once in 5 minutes. So I decided to have a time_before_close attribute in the inputs.conf so that the file handler doesn't close and I end up missing logs. But then I came across this problem
http://answers.splunk.com/answers/224653/why-is-time-before-close-attribute-causing-a-delay.html

0 Karma

bandit
Motivator

Do you have something like crcSalt= in your monitor rule? If so, that's what causing Splunk to reindex.

See this document:
http://docs.splunk.com/Documentation/Splunk/6.2.2/Data/Howlogfilerotationishandled

I would also recommend posting your config from inputs.conf.

0 Karma

nibinabr
Communicator

Thanks for the response. Following is my inputs.conf

[monitor:///Users/myusername/Desktop/splunk_monitor_logs]
disabled = false
followTail = 0
index = my_index
sourcetype = test_source
crcSalt = test_log
0 Karma

bmacias84
Champion

if you are using crsSalt you want to crcSalt=<SOURCE> . Additionally you can blacklist using regex, blacklist = .gz

nibinabr
Communicator

Hey,
I'm indexing .gz files as well. So if add crcSalt=SOURCE , then I will get duplicate events as every .gz file will be identified as a new file.

0 Karma

bandit
Motivator

My guess is that it reindexed when you added crcSalt = test_log to your config. So Splunk would see it once without the crcSalt and once with it and grabbed the log twice. Since test_log is a static crcSalt, I'm expecting this behavior won't repeat for any new logs. There's typically not a need to specify a static crcSalt unless you are trying to do a one-time reindex of something.

One way you could test this, assuming you have all the source logs and don't mind removing and reindexing. Comment out the crcSalt line. Replace the yourindexnamehere below with the actual index name below.
- stop splunk
- splunk clean eventdata yourindexnamehere
- splunk start

Splunk should index everything available once.

Once the indexing completed, uncomment the crcSalt = test_log, I would expect all existing logs to get indexed a 2nd time and new logs to only be indexed once.

On a side note, CRCsalt is more often used to force the indexing of files that Splunk skips due to two files having an identical file header, however the content following the header is unique. Splunk only checks the first 256 bytes of the file by default and may incorrectly assume it's already indexed a file in some cases where the header is longer than 256 bytes.

0 Karma

nibinabr
Communicator

I made a better explanation of my situation here.
http://answers.splunk.com/answers/223263/why-is-a-gz-file-created-by-log-rotation-indexed-a.html#com...

As described in the link mentioned above, I think I don't even need to have the crcSalt attribute option as I assume the first 256 bytes of the .log and .log.1.gz are the same and so splunk should skip indexing that .gz file. I have also explained the reason why I have to use crcSalt in the same comment.

0 Karma
Get Updates on the Splunk Community!

Enterprise Security Content Update (ESCU) | New Releases

In December, the Splunk Threat Research Team had 1 release of new security content via the Enterprise Security ...

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

(This is the first of a series of 2 blogs). Splunk Enterprise Security is a fantastic tool that offers robust ...

Index This | What are the 12 Days of Splunk-mas?

December 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...