Hi All,
Can someone please explain what is seekaddress and seekcrc in CRC in simple terms.
I tried to check documentation but looks quit confusing.
Read the below scenario but Little confused.
The CRC from the file beginning in the database has no matching record, indicating a file that Splunk hasn’t seen before. Splunk picks it up and ingests its data from the start of the file and updates the database with the new CRCs and Seek Addresses as it ingests the file.
When Splunk is monitoring a file, it regularly re-reads the first 256 bytes (configurable in inputs.conf) to make sure the file hasn't been rewritten. Those 256 bytes pass through an algorithm to produce a numeric value, called the seekcrc (not unlike a hash). As the file is read, Splunk remembers the current position within the file ("seekaddress") so it can pick up where it left off after a restart.
See https://www.splunk.com/en_us/blog/tips-and-tricks/what-is-this-fishbucket-thing.html
So when we mention CRC = <source> in inputs what actually happens.
I have created a monitor stanza for one source and it isn't sending logs to splunk.
When I checked internal logs it says failed to read file as it is too short check CRC something like that.
Firstly, it's <SOURCE>, not <source> (the case of the letters is important here).
Secondly - it means that the filename is appended to the CRC value so even if you have two files with the same header but different path they will not be considered as the same file by the input. Why would you want that? Because some files can have the same beginning part but differ somewhere later (typical use case - an app creates a new file every time it is restarted and each log file starts with the same report about the app's starting process like loading libraries and so on).
This option is rarely used but it's there in case you need it.
The initCRC = <source> setting adds the name of the input file to the algorithm used to compute the CRC. It helps prevent duplicate CRCs.
Ok. We are a monitor input. We see a new file. It might have been just created, it might have been renamed from another name within the same directory. We don't know that.
Firstly we check whether the filename is allowed by whitelists/blacklists combination and age limit.
If so, we're reading a beginning of the file and calculate CRC from the "header" of the file. We check the index of known files - so called fishbucket to see if we already know this CRC.
If we know this CRC it means we've already seen this file (maybe with another filename) so we're checking for the remembered position within the file where we last read its contents. And we resume reading from that position.
If we don't it's a completely new file and we start reading from the beginning.
As we're reading the file we update the remembered position within the file stored in the fishbucket so next time we encounter some file we can repeat the process.
That's a bit simplified description of how it works.