In my use-case my source log (tailed by a monitor input stanza) is being archived once a day at midnight and the resulting archive file is tailed by the same input stanza and the the original source log is being deleted. What I noticed is, if the splunk instance monitoring that source goes down while new events are still being written to the source log and if the splunk instance comes back up again only after the original file has been archived and the source log deleted, then the Archiving processor doesn’t verify if any new unread events can be found within the archive which the Tailreader couldn’t read (as during that time the splunk instance was down), please check following example:
02-05-2020 12:53:00.442 +0000 INFO ArchiveProcessor - Handling file=/etc/ArchiveFolder/sourcelog5.log.gz 02-05-2020 12:53:00.443 +0000 INFO ArchiveProcessor - reading path=/etc/ArchiveFolder/sourcelog5.log.gz (seek=0 len=784) 02-05-2020 12:53:00.499 +0000 INFO ArchiveProcessor - Archive with path="/etc/ArchiveFolder/sourcelog5.log.gz" was already indexed as a non-archive, skipping. 02-05-2020 12:53:00.499 +0000 INFO ArchiveProcessor - Finished processing file '/etc/ArchiveFolder/sourcelog5.log.gz', removing from stats 02-05-2020 13:01:31.503 +0000 INFO WatchedFile - Will begin reading at offset=12392 for file='/etc/ArchiveFolder/sourcelog5.log.gz'.
Based on the documentation:
I would understand that both the Tailing and the Archiving processor should behave the same, but apparently that is not the case here. I also did the complementary test and extracted again the source log within the archive and at that point the Tailing processor realises that there are effectively still some new unread events and it it will start ingesting those at that stage. Why is the Archiving processor missing those new unread events?
Even though the official docs are not very explicit on this aspect, TailingProcessor (for plain log files) and ArchiveProcessor (for archived logs) have slightly different implementations:
For the time being these are the various possibilities I currently see to move this forward (I have tried to summarise them in 3 macro categories):
A. sourcelog management logic change, for example:
A.1. instead of archive the older logs, simply rotate or rename them as plain log file so that the ArchiveProcessor component is avoided.
A.2. purely read archive files and not plain log files, so that all archive files are always seen as "new" and read and ingested in full by the ArchiveProcessor, with no change of missing events.
A.3. extend the log management logic, so that it becomes aware of an ongoing splunk outage, also from a temporal perspective, so that, for example, archives being generated during a splunk outage, are being extracted once again immediately after spunk is back available, so that the TailingProcessor can finish off reading all new events which had been ignored/skipped by the ArchiveProcessor earlier.
A.4. I am sure that there are other variants of things which could be done in order to change the sourcelog management on the UF servers in question.
B. a custom input implementation, for example a custom scripted or a modular input.
C. a Splunk Idea (new portal for ERs and FRs) to be raised for future evaluation/implementation based on the PM prioritisation.