I am trying to migrate date from local storage to remote store and would like to understand best way to monitor the progress.
Can you correct the typo for "understane" please?
Also perhaps you can accept your own answers so they are marked as closed?
I appreciate the question/answer format, hopefully some of these queries are fed back into the monitoring console...
The migration from local storage to remote store ( like S3) will start when Cluster bundle with configuration from remote store is deployed from cluster master to Cluster Peer. The migration itself will happened on indexers . During migration the peers will upload all the searchable copied to remote store. When Multiple peer opload the same copy of bucket to remote-store, only one copy will remain and get uploaded.
Once the migration is complete on indexer it will not be attempted again( If need it can be manually triggered). Below are the sample searches you may use to look aspect of migration process.
1)Tracing start of the migration. ( splunkd.log component: DatabaseDirectoryManager has one entry per index)
SPL
index=_internal source="splunkd.log" DatabaseDirectoryManager "Remote storage migration needed" | timechart count by idx
Sample Event:
11-21-2018 06:38:20.514 +0000 INFO DatabaseDirectoryManager - Remote storage migration needed for idx=main for a bucket count=34
This event has the index name and the count of buckets to be migrated.
Screen Shot:
2)To track end of migration ( it’s for all indexes )
SPL
index=_internal source="splunkd.log" component=CacheManager "Remote storage migration" completed
Sample Event:
11-21-2018 06:38:28.957 +0000 INFO CacheManager - Remote storage migration of buckets and summaries completed (duration_sec=8 upload_jobs=67)
Screen Shot:
Note : you can compare that upload_jobs to match with the Total sum of the count for each index
3) Here is a SPL that can also be used to see the progress of the migration ,but it has some limitation
| rest /services/admin/cacheman/_metrics splunk_server=<INDEXERS>
| rename migration.total_jobs AS migration_jobs_total,migration.current_job AS migration_jobs_complete
| eval migration_jobs_remaining=migration_jobs_total-migration_jobs_complete
| fillnull migration.end_epoch value="-"
| stats count by splunk_server migration.start_epoch migration.end_epoch migration.status migration_jobs_total migration_jobs_complete migration_jobs_remaining
| eval percent_complete = round((migration_jobs_complete/migration_jobs_total)*100,1)
| eval current_time_secs=now()
| eval time_elapsed_secs=if('migration.status'="finished",('migration.end_epoch'- 'migration.start_epoch'),(current_time_secs - 'migration.start_epoch'))
| eval secs_per_job=time_elapsed_secs/migration_jobs_complete
| eval time_remaining_secs=migration_jobs_remaining*secs_per_job
| eval seconds_per_job=round((secs_per_job),2)
| convert timeformat="%+" ctime(migration.start_epoch) AS migration_start_time
| convert timeformat="%+" ctime(migration.end_epoch) AS migration_end_time
| eval migration_end_time=if('migration.status'="finished",migration_end_time,"-")
| convert timeformat="%+" ctime(current_time_secs) AS current_time
| eval current_time=if('migration.status'="finished","-",current_time)
| eval time_elapsed_hours=round(time_elapsed_secs/3600,2)
| eval time_remaining_hours=round((time_remaining_secs/3600),2)
| table splunk_server migration.status migration_start_time migration_end_time current_time migration_jobs_total migration_jobs_complete migration_jobs_remaining percent_complete time_elapsed_hours time_remaining_hours seconds_per_job
The above search is sometime misleading, for example in case the indexer crashes/shutdown, the search will show finished as 100%.
Screenshot:
4)Upload Operation can be monitored :
SPL
index=_internal source=/metrics.log TERM(group=cachemgr_upload) | timechart span=1s sum(queued) AS queued, sum(succeeded) AS succeeded by host
Sample Event
10-25-2018 10:48:06.599 +0000 INFO Metrics - group=cachemgr_upload, elapsed_ms=17017, kb=124372, succeeded=1
5) Upload speed
SPL:
index=_audit ( action=local_bucket_upload AND ( sourcetype=audittrail )) | eval elapsed_s=elapsed_ms/1000 | eval kbps = kb/elapsed_s |eval mbps=kbps/1024 | timechart span=1s max(mbps) by host
Sample Event :
Audit:[timestamp=10-25-2018 10:47:37.615, user=n/a, action=local_bucket_upload, info=completed, cache_id="bid|_internal~40~C3912E39-C49C-4A24-B119-AA4B13C0F3F1|", local_dir="/home/splunker/splunk/var/lib/splunk/_internaldb/db/db_1540464387_1540461589_40_C3912E39-C49C-4A24-B119-AA4B13C0F3F1", kb=124372, elapsed_ms=17017][n/a]
6)Role of file buckets_synced_to_remote_storage in migration:
find . -type f -name .buckets_synced_to_remote_storage
./var/lib/splunk/audit/db/.buckets_synced_to_remote_storage
./var/lib/splunk/_internaldb/db/.buckets_synced_to_remote_storage
./var/lib/splunk/_introspection/db/.buckets_synced_to_remote_storage
./var/lib/splunk/_telemetry/db/.buckets_synced_to_remote_storage
./var/lib/splunk/fishbucket/db/.buckets_synced_to_remote_storage
./var/lib/splunk/historydb/db/.buckets_synced_to_remote_storage
./var/lib/splunk/defaultdb/db/.buckets_synced_to_remote_storage
./var/lib/splunk/summarydb/db/.buckets_synced_to_remote_storage
At start-up, if an index is S2-enabled, we check to see if buckets need to be uploaded. To check if buckets need to be uploaded we look if file $homePath/.buckets_synced_to_remote_storage exists. The presence of this file indicates that we don't need to upload files to the remote storage and therefore no migration needs to happen.
7) Here is another search to confirm migration of indexers.
./splunk search "|rest /services/admin/cacheman |search cm:bucket.stable=0 |stats count" # should return zero