AT the high level these are the steps:
i)Once the bucket is rolled to warm, "remote_storage_upload_timeout" timer is started on target peers, it is registered with the CacheManager on source, bucket is optimized, it is uploaded to remote storage by source and bucket registration ends. Once the source peer uploads the bucket to remote storage it notifies the target peers that bucket has been uploaded to remote storage. After the target peers receive that message from source peer, they cancel the registration with cachemanager, mark the bucket as stable and evict the bucket. Below is a breakdown of above steps with corresponding log messages -
Source peer rolls the bucket
05-15-2019 15:46:44.844 +0000 INFO HotBucketRoller - finished moving hot to warm bid=perfmon~468~FA94F613-032D-4C8E-9D04-EFA3F5E923C9 idx=perfmon from=hot_v1_468 to=db_1557754467_1557734922_468_FA94F613-032D-4C8E-9D04-EFA3F5E923C9 size=397975552 caller=lru maxHotBuckets=10, count=11 hot buckets,evicting_count=1 LRU hot s
ii)Done key received on target peer(which means we are done with replication from source**)
05-15-2019 15:46:44.879 +0000 INFO S2SFileReceiver - event=onDoneReceived replicationType=eJournalReplication bid=perfmon~468~FA94F613-032D-4C8E-9D04-EFA3F5E923C9
05-15-2019 15:46:44.879 +0000 INFO S2SFileReceiver - about to finalize from close bid=perfmon~468~FA94F613-032D-4C8E-9D04-EFA3F5E923C9
iii)Target peer starts the timer "remote_storage_upload_timeout", so that if it doesn't hear from the source peer until timer expires then it can start the upload of the bucket and also rolls the bucket from hot→warm at its end.**
INFO CMSlave - bid=perfmon~468~FA94F613-032D-4C8E-9D04-EFA3F5E923C9 added so this target peer can assume responsility of upload later
05-15-2019 15:46:44.884 +0000 INFO S2SFileReceiver - event=rename bid=perfmon~468~FA94F613-032D-4C8E-9D04-EFA3F5E923C9 from=/opt/splunk/var/lib/splunk/perfmon/db/468_FA94F613-032D-4C8E-9D04-EFA3F5E923C9 to=/opt/splunk/var/lib/splunk/perfmon/db/db_1557754467_1557734922_468_FA94F613-032D-4C8E-9D04-EFA3F5E923C9
05-15-2019 15:46:44.884 +0000 INFO CMSlave - bid=perfmon~468~FA94F613-032D-4C8E-9D04-EFA3F5E923C9 Transitioning status from=StreamingTarget to=Complete for reason="hot success (target)"
iv)Meanwhile, source starts upload of bucket since it has finished optimize/repair process for the bucket. It also saves the state of the files in the bucket directory locally by writing to the file "cachemanager_local.json"**
5-15-2019 15:46:54.718 +0000 INFO DatabaseDirectoryManager - cid="bid|perfmon~468~FA94F613-032D-4C8E-9D04-EFA3F5E923C9|" uploading the bucket to remote storage since optimize/repair process has completed successfully
05-15-2019 15:46:54.723 +0000 INFO CacheManager - action=upload, cacheId="bid|perfmon~468~FA94F613-032D-4C8E-9D04-EFA3F5E923C9|", status=attempting
05-15-2019 15:47:00.786 +0000 INFO CacheManager - action=upload, cacheId="bid|perfmon~468~FA94F613-032D-4C8E-9D04-EFA3F5E923C9|", status=succeeded, elapsed_ms=6063
Corresponding entry in audit.log for the bucket upload
05-15-2019 15:46:54.723 +0000 INFO AuditLogger - Audit:[timestamp=05-15-2019 15:46:54.723, user=n/a, action=local_bucket_upload, info=started, cache_id="bid|perfmon~468~FA94F613-032D-4C8E-9D04-EFA3F5E923C9|", prefix=reedexpo/perfmon/db/4d/e7/468~FA94F613-032D-4C8E-9D04-EFA3F5E923C9/guidSplunk-FA94F613-032D-4C8E-9D04-EFA3F5E923C9][n/a]
05-15-2019 15:47:00.817 +0000 INFO AuditLogger - Audit:[timestamp=05-15-2019 15:47:00.817, user=n/a, action=local_bucket_upload, info=completed, cache_id="bid|perfmon~468~FA94F613-032D-4C8E-9D04-EFA3F5E923C9|", local_dir="/opt/splunk/var/lib/splunk/perfmon/db/db_1557754467_1557734922_468_FA94F613-032D-4C8E-9D04-EFA3F5E923C9", kb=382940, elapsed_ms=6095][n/a]
NOTE: "cachemanager_local.json" is a local file that resides in db directory for warm buckets. It is used to maintain the state of what files are present locally in the disk. We update this file when we are either about to upload the bucket or we download the bucket contents when a search opens the bucket or we cancel the upload.
The contents of this file looks something like this -
v)Source peer reports upload status to replication target/s**
05-15-2019 15:47:00.817 +0000 INFO CMSlave - bid=perfmon~468~FA94F613-032D-4C8E-9D04-EFA3F5E923C9 upload status being reported to the replicated targets
05-15-2019 15:47:00.818 +0000 INFO CMRepJob - running job=CMReportBucketInStableStorageJob bid=perfmon~468~FA94F613-032D-4C8E-9D04-EFA3F5E923C9 ot_guid=5B2CABAA-22E8-4B25-AE31-C089D69FE13D ot_hp=INDEXER:8089
vi)Target peer/s receive the notification from source peer and update their metadata with remote storage metadata by checking if the bucket is present on remote storage.**
05-15-2019 15:47:00.844 +0000 INFO CMSlave - bid=perfmon~468~FA94F613-032D-4C8E-9D04-EFA3F5E923C9 reported to be on remote storage by upload peer, will confirm it is present by checking the remote storage
05-15-2019 15:47:00.871 +0000 INFO DatabaseDirectoryManager - cid="bid|perfmon~468~FA94F613-032D-4C8E-9D04-EFA3F5E923C9|" found to be on remote storage
05-15-2019 15:47:00.871 +0000 INFO IndexerIf - Asked to update bucket manifest values, bid=perfmon~468~FA94F613-032D-4C8E-9D04-EFA3F5E923C9
05-15-2019 15:47:00.903 +0000 INFO DatabaseDirectoryManager - idx=perfmon Writing a bucket manifest in hotWarmPath='/opt/splunk/var/lib/splunk/perfmon/db', pendingBucketUpdates=0 . Reason='Updated metadata of bucket with remote storage metadata, bid=perfmon~468~FA94F613-032D-4C8E-9D04-EFA3F5E923C9'
NOTE: This step of target getting notified by source about uploading the bucket has been made optional in recent versions. By default this feature is turned off. So, targets won't receive the notification from source that it has uploaded the bucket and target will eventually check the bucket on remote storage after remote_storage_upload_timeout and if its present then just marks the buckets stable as part of cancelled upload. Below is the configuration which is introduced to make this feature optional -
report_remote_storage_bucket_upload_to_targets =
* Only valid for 'mode=slave'.
* For a remote storage enabled index, this attribute specifies whether
the source peer reports the successful bucket upload to target peers.
This notification is used by target peers to cancel their upload timers
and synchronize their bucket state with the uploaded bucket on remote
storage.
* Do not change the value from the default unless instructed by
Splunk Support.
* Default: false
vii)Now, target peer cancels the registration of bucket with cachemanager, mark the bucket as stable and then evict the bucket locally**
05-15-2019 15:47:00.905 +0000 INFO CMSlave - bid=perfmon~468~FA94F613-032D-4C8E-9D04-EFA3F5E923C9 removed from the replicatedBucketsUploadTimeout map
05-15-2019 15:47:00.912 +0000 INFO CacheManager - cancel registering new cacheId="bid|perfmon~468~FA94F613-032D-4C8E-9D04-EFA3F5E923C9|" for search sid=bid|perfmon~468~FA94F613-032D-4C8E-9D04-EFA3F5E923C9|
05-15-2019 15:47:00.912 +0000 INFO CacheManager - Making cacheId="bid|perfmon~468~FA94F613-032D-4C8E-9D04-EFA3F5E923C9|" stable as part of cancelled upload
The corresponding entry in audit.log entry which logs eviction of bucket -
05-15-2019 15:47:00.952 +0000 INFO AuditLogger - Audit:[timestamp=05-15-2019 15:47:00.952, user=splunk-system-user, action=local_bucket_evict, info=completed, cache_id="bid|perfmon~468~FA94F613-032D-4C8E-9D04-EFA3F5E923C9|", kb=389622, elapsed_ms=15, files="strings_data,sourcetypes_data,sources_data,hosts_data,lex,tsidx,bloomfilter,journal_gz,deletes,other"][n/
the "files" evicted in the above log entry. If you are evicting all the files in bucket directory, that usually means that target is evicting the bucket because source or some other peer has already uploaded the bucket. You will encounter other local_bucket_evict logs in audit.log which will have different "files" to be evicted, which can be due to other reasons covered later in this page(most commonly "deletes" files, which are evicted due to primary changes).
... View more