Getting Data In

How to avoid indexing events twice when applying crcSalt=

Explorer

Hello

We are indexing a file structure like /opt/logs////.
with YYYY=year, MM=month and DD=day.
So far, we have not been using crcSalt but we now have to apply crcSalt= for some of the smaller file types.

We currently have only one monitor stanza for /opt/logs and I do not want to switch on crcSalt= for all files.
My plan is to exclude the file names in question from the /opt/log stanza's whitelist and create a new stanza like /opt/logs////(|) and then add the relevant files to that stanza's whitelist and set crcSalt=.

But how do I avoid that events get indexed twice, when I switch on crcSalt for these files ?
All the files already indexed will not be recognised as indexed as the CRC calculation is changed.

Thanks.

1 Solution

Esteemed Legend

The only reason to muck with crcSalt is if files are being skipped because they are failing the CRC because they are too similar and Splunk thinks that they are identical. Is this your problem? What are you planning to set the value to; are you planning on <SOURCE>? If so, the only way that you will get duplicated data is if something copies a file to another name. Using this effectively disables the CRC check entirely so any files that show up, it will index.

From the dox:

* If set to the literal string <SOURCE> (including the angle brackets), the
  full directory path to the source file is added to the CRC. This ensures
  that each file being monitored has a unique CRC.   When crcSalt is invoked,
 it is usually set to <SOURCE>.

View solution in original post

0 Karma

Esteemed Legend

The only reason to muck with crcSalt is if files are being skipped because they are failing the CRC because they are too similar and Splunk thinks that they are identical. Is this your problem? What are you planning to set the value to; are you planning on <SOURCE>? If so, the only way that you will get duplicated data is if something copies a file to another name. Using this effectively disables the CRC check entirely so any files that show up, it will index.

From the dox:

* If set to the literal string <SOURCE> (including the angle brackets), the
  full directory path to the source file is added to the CRC. This ensures
  that each file being monitored has a unique CRC.   When crcSalt is invoked,
 it is usually set to <SOURCE>.

View solution in original post

0 Karma

Explorer

Thanks for your answer, Woodcock.

That's indeed the reason why we want to use crcSalt, some files are not being indexed and the log messages suggest the we use crcSalt to deal with it. And yes, we will use the SOURCE value (with angle brackets).

But I think I did not make my point clear. We come from a situation where we do not use crcSalt. I tried to include crcSalt in our test system for the files in question, and the system started to re-index already indexed files. That makes sense to me, since the crc value has changed. So my question is how do I avoid that the system initially re-indexes files that are within the IgnoreOlderThan time span that are already indexed?
My idea is to set followTail=1 initially and then change it later to zero when IgnoreOlderThan has elapsed.

0 Karma

Esteemed Legend

Yes, your plan is perfect and exactly what I was going to suggest.

0 Karma

Explorer

Thank a lot.

0 Karma

Explorer

Sorry, it seems that the rendering of the post was not quite right.
The filestructure is like this /opt/logs/SERVER/YYYY/MM/FILENAME.YYYYMMDD

The new stanza would contain wild cards in place of SERVER, YYYY and MM and just refer to the file names that need to have crcSalt

0 Karma