Hello,
We have a two-site indexer cluster and need to take one site down for around 12 hours for maintenance happening in the data center where that site resides. We have the following settings in place for site_replication_factor and site_search_factor:
site_replication_factor = origin:1,site1:1,site2:1,total:2
site_search_factor = origin:1,site1:1,site2:1,total:2
What would be the best way to proceed with taking one site down? Should we put CM in maintenance mode for the entirety of the maintenance period?
Thanks.
Enabled the Maintenance mode on the CM
$SPLUNK_HOME/bin/splunk enable maintenance-mode
Take the indexer peer offline gradually
$SPLUNK_HOME/bin/splunk offline
This command ensures the CM properly reassigns primaries to peers at the other site before shutting down the indexer.
Note: The offline process duration depends on the number and size of your buckets. It can take several minutes to complete.
Keep on Monitoring the bucket status from the Cluster Manager, once all the buckets are distributed to the other peers then you can proceed for the activity and make sure you have enough spaces on the site 1 because basically its distributing the buckets to the other peers so we need physical space on the peers.
if you dont have enough space just make it in two schedule
Once all the activity completed then go to the Cluster Manger and check the bucket they are lot of buckets are pilling up for the sync, remove the maintenance mode from the cluster and start the Splunk services on the indexer peers and observe the bucket status
$SPLUNK_HOME/bin/splunk disable maintenance-mode
once the cluster is healthy and back to the normal, the built-in fix up process redistributes the missing copies and restores search primaries automatically.
Just reporting back on this for others. This actually went really well for us. For more context, we have 10 indexers on each site with around 45K buckets on each indexer. We are also on version 9.4.1. We had one site down for about 9 hours while our data center performed maintenance. Before the maintenance, we put CM in MM, offlined the indexers one at a time (with bin/splunk offline) waiting for a "Restarting" status before proceeding to the next indexer. Once the data center maintenance was complete, we started Splunk up on all indexers at once, waited for an "Up" status for all indexers in the CM UI, and then took CM out of MM. The fixup time was less than 30 minutes, which is better than we expected. We have noticed that our current version of 9.4.1 seems to be more efficient with fixup than some of our previous versions.
Enabled the Maintenance mode on the CM
$SPLUNK_HOME/bin/splunk enable maintenance-mode
Take the indexer peer offline gradually
$SPLUNK_HOME/bin/splunk offline
This command ensures the CM properly reassigns primaries to peers at the other site before shutting down the indexer.
Note: The offline process duration depends on the number and size of your buckets. It can take several minutes to complete.
Keep on Monitoring the bucket status from the Cluster Manager, once all the buckets are distributed to the other peers then you can proceed for the activity and make sure you have enough spaces on the site 1 because basically its distributing the buckets to the other peers so we need physical space on the peers.
if you dont have enough space just make it in two schedule
Once all the activity completed then go to the Cluster Manger and check the bucket they are lot of buckets are pilling up for the sync, remove the maintenance mode from the cluster and start the Splunk services on the indexer peers and observe the bucket status
$SPLUNK_HOME/bin/splunk disable maintenance-mode
once the cluster is healthy and back to the normal, the built-in fix up process redistributes the missing copies and restores search primaries automatically.