Solved: Why doesn’t the Archive Processor verify if there ...

mgaraventa_splu · ‎03-09-2020

In my use-case my source log (tailed by a monitor input stanza) is being archived once a day at midnight and the resulting archive file is tailed by the same input stanza and the the original source log is being deleted. What I noticed is, if the splunk instance monitoring that source goes down while new events are still being written to the source log and if the splunk instance comes back up again only after the original file has been archived and the source log deleted, then the Archiving processor doesn’t verify if any new unread events can be found within the archive which the Tailreader couldn’t read (as during that time the splunk instance was down), please check following example:

02-05-2020 12:53:00.442 +0000 INFO ArchiveProcessor - Handling file=/etc/ArchiveFolder/sourcelog5.log.gz
02-05-2020 12:53:00.443 +0000 INFO ArchiveProcessor - reading path=/etc/ArchiveFolder/sourcelog5.log.gz (seek=0 len=784)
02-05-2020 12:53:00.499 +0000 INFO ArchiveProcessor - Archive with path="/etc/ArchiveFolder/sourcelog5.log.gz" was already indexed as a non-archive, skipping.
02-05-2020 12:53:00.499 +0000 INFO ArchiveProcessor - Finished processing file '/etc/ArchiveFolder/sourcelog5.log.gz', removing from stats
02-05-2020 13:01:31.503 +0000 INFO WatchedFile - Will begin reading at offset=12392 for file='/etc/ArchiveFolder/sourcelog5.log.gz'.

Based on the documentation:

https://docs.splunk.com/Documentation/Splunk/8.0.2/Data/Howlogfilerotationishandled

I would understand that both the Tailing and the Archiving processor should behave the same, but apparently that is not the case here. I also did the complementary test and extracted again the source log within the archive and at that point the Tailing processor realises that there are effectively still some new unread events and it it will start ingesting those at that stage. Why is the Archiving processor missing those new unread events?

mgaraventa_splu · ‎03-09-2020

Even though the official docs are not very explicit on this aspect, TailingProcessor (for plain log files) and ArchiveProcessor (for archived logs) have slightly different implementations:

both TailingProcessor and ArchiveProcessor will have the same behaviour if the CRC check validation outcome results is a "no match" -> new file. The log will be read and ingested by splunk in full.
if the CRC check validation outcome is "match" then we have a difference between the 2 components:
TailingProcessor will check also if the the size has changed and new events been added. If that is the case, then the new events will be read and ingested as well.
ArchiveProcessor, once the CRC check validation outcome is confirmed to be a "match" (already known file), will assume that the file is "old" and already ingested and will just skip any other processing.

For the time being these are the various possibilities I currently see to move this forward (I have tried to summarise them in 3 macro categories):

A. sourcelog management logic change, for example:
A.1. instead of archive the older logs, simply rotate or rename them as plain log file so that the ArchiveProcessor component is avoided.
A.2. purely read archive files and not plain log files, so that all archive files are always seen as "new" and read and ingested in full by the ArchiveProcessor, with no change of missing events.
A.3. extend the log management logic, so that it becomes aware of an ongoing splunk outage, also from a temporal perspective, so that, for example, archives being generated during a splunk outage, are being extracted once again immediately after spunk is back available, so that the TailingProcessor can finish off reading all new events which had been ignored/skipped by the ArchiveProcessor earlier.
A.4. I am sure that there are other variants of things which could be done in order to change the sourcelog management on the UF servers in question.
B. a custom input implementation, for example a custom scripted or a modular input.
C. a Splunk Idea (new portal for ERs and FRs) to be raised for future evaluation/implementation based on the PM prioritisation.

View solution in original post

mgaraventa_splu · ‎03-09-2020

Even though the official docs are not very explicit on this aspect, TailingProcessor (for plain log files) and ArchiveProcessor (for archived logs) have slightly different implementations:

both TailingProcessor and ArchiveProcessor will have the same behaviour if the CRC check validation outcome results is a "no match" -> new file. The log will be read and ingested by splunk in full.
if the CRC check validation outcome is "match" then we have a difference between the 2 components:
TailingProcessor will check also if the the size has changed and new events been added. If that is the case, then the new events will be read and ingested as well.
ArchiveProcessor, once the CRC check validation outcome is confirmed to be a "match" (already known file), will assume that the file is "old" and already ingested and will just skip any other processing.

For the time being these are the various possibilities I currently see to move this forward (I have tried to summarise them in 3 macro categories):

A. sourcelog management logic change, for example:
A.1. instead of archive the older logs, simply rotate or rename them as plain log file so that the ArchiveProcessor component is avoided.
A.2. purely read archive files and not plain log files, so that all archive files are always seen as "new" and read and ingested in full by the ArchiveProcessor, with no change of missing events.
A.3. extend the log management logic, so that it becomes aware of an ongoing splunk outage, also from a temporal perspective, so that, for example, archives being generated during a splunk outage, are being extracted once again immediately after spunk is back available, so that the TailingProcessor can finish off reading all new events which had been ignored/skipped by the ArchiveProcessor earlier.
A.4. I am sure that there are other variants of things which could be done in order to change the sourcelog management on the UF servers in question.
B. a custom input implementation, for example a custom scripted or a modular input.
C. a Splunk Idea (new portal for ERs and FRs) to be raised for future evaluation/implementation based on the PM prioritisation.

Why doesn’t the Archive Processor verify if there are new unread events within my archive file?

other

Earn a $35 Gift Card for Answering our Splunk Admins & App Developer Survey

Continuing Innovation & New Integrations Unlock Full Stack Observability For Your ...

Monitoring Amazon Elastic Kubernetes Service (EKS)