My two cents:
Without 6.5's magical rebalance cluster unicorn command:
Add the new indexers
Leverage indexer discovery and weighted load balancing to drive all traffic to new indexers.
Change your RF and SF as this affects only new data coming in, not the past data.
Take down one old indexers (./splunk offline --enforce-counts)
Wait for the buckets to be redistributed between the old and new indexers (This might take time since only one copy is searchable)
Repeat Steps 4-5 for the rest of the indexers
Replace drives in old servers
Change the weighted load balancing factor from step 2 to send data across all indexers (or if you are picky, reverse the distribution factor from step 2 for some time so that the older indexers catch up with the new indexers)
Bask in the glory of a new indexer cluster with 140% more h/w and 500% more drive space. On SSDs.
Profit!
With 6.5
Add the new indexers
Leverage indexer discovery and weighted load balancing to drive all traffic to new indexers.
Change your RF and SF as this affects only new data coming in, not the past data.
Take down one old indexers (./splunk offline --enforce-counts)
Wait for the buckets to be redistributed between the old and new indexers (This might take time since only one copy is searchable)
Repeat Steps 4-5 for the rest of the indexers
Replace drives in old servers
Rebalance
Change the weighted load balancing factor from step 2 to uniformly send data across all indexers
Bask in the glory of a new indexer cluster with 140% more h/w and 500% more drive space. On SSDs.
Profit!
The only difference between the two approaches is that with 6.5 you have the flexibility to rebalance AFTER you add disks to the old servers. You still need to rebalance using the hacky take-one-indexer-down-at-a-time approach to ensure your old data is searchable at all times during the upgrade.
You could move around step 3 since it only affects new data. Also, there may be a bug in the splunk offline command. In which case, you could just replace that with ./splunk stop command. After a time out interval it should kick in the same bucket remedial activities.
... View more