Other Admin

Taking one site of two site indexer cluster down for extended maintenance

gacorey1
Explorer

Hello,

We have a two-site indexer cluster and need to take one site down for around 12 hours for maintenance happening in the data center where that site resides.  We have the following settings in place for site_replication_factor and site_search_factor:

site_replication_factor = origin:1,site1:1,site2:1,total:2
site_search_factor = origin:1,site1:1,site2:1,total:2

What would be the best way to proceed with taking one site down? Should we put CM in maintenance mode for the entirety of the maintenance period?

Thanks.

Labels (1)
0 Karma
1 Solution

thahir
Contributor

@gacorey1 

 

Enabled the Maintenance mode on the CM

$SPLUNK_HOME/bin/splunk enable maintenance-mode

Take the indexer peer offline gradually

$SPLUNK_HOME/bin/splunk offline

This command ensures the CM properly reassigns primaries to peers at the other site before shutting down the indexer.
Note: The offline process duration depends on the number and size of your buckets. It can take several minutes to complete.

Keep on Monitoring the bucket status from the Cluster Manager, once all the buckets are distributed to the other peers then you can proceed for the activity and make sure you have enough spaces on the site 1 because basically its distributing the buckets to the other peers so we need physical space on the peers.
if you dont have enough space just make it in two schedule

Once all the activity completed then go to the Cluster Manger and check the bucket they are lot of buckets are pilling up for the sync, remove the maintenance mode from the cluster and start the Splunk services on the indexer peers and observe the bucket status

$SPLUNK_HOME/bin/splunk disable maintenance-mode

once the cluster is healthy and back to the normal, the built-in fix up process redistributes the missing copies and restores search primaries automatically.

View solution in original post

0 Karma

gacorey1
Explorer

Just reporting back on this for others. This actually went really well for us. For more context, we have 10 indexers on each site with around 45K buckets on each indexer. We are also on version 9.4.1. We had one site down for about 9 hours while our data center performed maintenance. Before the maintenance, we put CM in MM, offlined the indexers one at a time (with bin/splunk offline) waiting for a "Restarting" status before proceeding to the next indexer. Once the data center maintenance was complete, we started Splunk up on all indexers at once, waited for an "Up" status for all indexers in the CM UI, and then took CM out of MM. The fixup time was less than 30 minutes, which is better than we expected. We have noticed that our current version of 9.4.1 seems to be more efficient with fixup than some of our previous versions.

0 Karma

thahir
Contributor

@gacorey1 

 

Enabled the Maintenance mode on the CM

$SPLUNK_HOME/bin/splunk enable maintenance-mode

Take the indexer peer offline gradually

$SPLUNK_HOME/bin/splunk offline

This command ensures the CM properly reassigns primaries to peers at the other site before shutting down the indexer.
Note: The offline process duration depends on the number and size of your buckets. It can take several minutes to complete.

Keep on Monitoring the bucket status from the Cluster Manager, once all the buckets are distributed to the other peers then you can proceed for the activity and make sure you have enough spaces on the site 1 because basically its distributing the buckets to the other peers so we need physical space on the peers.
if you dont have enough space just make it in two schedule

Once all the activity completed then go to the Cluster Manger and check the bucket they are lot of buckets are pilling up for the sync, remove the maintenance mode from the cluster and start the Splunk services on the indexer peers and observe the bucket status

$SPLUNK_HOME/bin/splunk disable maintenance-mode

once the cluster is healthy and back to the normal, the built-in fix up process redistributes the missing copies and restores search primaries automatically.

0 Karma
Get Updates on the Splunk Community!

Stay Connected: Your Guide to January Tech Talks, Office Hours, and Webinars!

What are Community Office Hours? Community Office Hours is an interactive 60-minute Zoom series where ...

[Puzzles] Solve, Learn, Repeat: Reprocessing XML into Fixed-Length Events

This challenge was first posted on Slack #puzzles channelFor a previous puzzle, I needed a set of fixed-length ...

Data Management Digest – December 2025

Welcome to the December edition of Data Management Digest! As we continue our journey of data innovation, the ...