We need to replace one of the local hard disks in a Splunk indexer that is part of our multi-site (2 site) index cluster. We want to do this without kicking off any bucket fixup activity because we plan on restoring the replaced disk with backups rather than let 5TB of data replicate across the cluster unnecessarily.
We plan to have the peer down for an extended period of time while the restore job is occurring on the new local disk. In the meantime while this node is down I know that the cluster will be in a valid state due to the "1:1" replication of searchable/replicated buckets we have between both sites.
My question is, if we're going to take a peer node down for an extended amount of time and we DON'T want to have bucket fixup activities occur should we enable cluster maintenance mode or use the 'offline' mode for the downed peer and increase the cluster masters restart_timeout config to enough hours to cover the maintenance window?
The only thing the Splunk docs really state is that "When you take a peer offline temporarily, it is usually to perform an upgrade or other maintenance for a short period of time". As opposed to maintenance mode that specifically states that bucket fixup activities are mostly halted during the duration of maintenance mode, though they do not give any guidance as to how long you can safely be in maintenance mode and still have a valid cluster.
Can you enable maintenance mode for an extended period of time and still have a valid cluster if you're normally forwarding data to both site1 and site2 indexers simultaneously?
Use maintenance-mode for extended periods (such as this one)
Either one should work. However, note that maintenance mode does not currently persist across restarts of the cluster master. Also, maintenance mode stops fixup activity related to all peers, while the offline command stops fixup only for the peer being offlined.
In terms of maintaining a valid cluster during long periods of maintenance, the ability to do so depends on the stability of your cluster as well as the search factor. For instance, if you have a search factor of 2 and another peer goes down during the maintenance period, you might temporarily lose searchability for some buckets.
I'm not sure what you mean regarding your statement about forwarders. If you're using load-balancing as recommended, then the forwarder will just skip over the downed peer.