Solved: What's the best method to update/replace indexer c...

twinspop · ‎11-17-2016

We will be getting another batch of indexers in shortly, and each will have substantially more drive space than the old, 20x1.6TB SSD vs 20x600GB spinny. We will have more new servers than old. And we have approval to swap the drives in the old servers for new ones. (The h/w is otherwise the same. 20 core xeon, 128 GB. HP DL380G9.) Christmas is coming early. Woot woot.

Sketchy plan follows:

Increase RF and SF from RF=2/SF=1 to RF=3/SF=2
Wait for this move to settle
Shutdown an 'old' server
Slot in a new server to replace the old
Wait for Cluster magix; RF and SF met.
Optionally use 6.5.0's new goodness to rebalance data. Maybe?
Repeat 2-5 until all the old servers are OOS and new ones are in place
Replace drives in old servers
Add all old servers to the cluster as new members
Rebalance data one last time
Bask in the glory of a new indexer cluster with 140% more h/w and 500% more drive space. On SSDs.

(repeat entire process for our other cluster)

sk314 · ‎11-17-2016

My two cents:

Without 6.5's magical rebalance cluster unicorn command:

Add the new indexers
Leverage indexer discovery and weighted load balancing to drive all traffic to new indexers.
Change your RF and SF as this affects only new data coming in, not the past data.
Take down one old indexers (./splunk offline --enforce-counts)
Wait for the buckets to be redistributed between the old and new indexers (This might take time since only one copy is searchable)
Repeat Steps 4-5 for the rest of the indexers
Replace drives in old servers
Change the weighted load balancing factor from step 2 to send data across all indexers (or if you are picky, reverse the distribution factor from step 2 for some time so that the older indexers catch up with the new indexers)
Bask in the glory of a new indexer cluster with 140% more h/w and 500% more drive space. On SSDs.
Profit!

With 6.5

Add the new indexers
Leverage indexer discovery and weighted load balancing to drive all traffic to new indexers.
Change your RF and SF as this affects only new data coming in, not the past data.
Take down one old indexers (./splunk offline --enforce-counts)
Wait for the buckets to be redistributed between the old and new indexers (This might take time since only one copy is searchable)
Repeat Steps 4-5 for the rest of the indexers
Replace drives in old servers
Rebalance
Change the weighted load balancing factor from step 2 to uniformly send data across all indexers
Bask in the glory of a new indexer cluster with 140% more h/w and 500% more drive space. On SSDs.
Profit!

The only difference between the two approaches is that with 6.5 you have the flexibility to rebalance AFTER you add disks to the old servers. You still need to rebalance using the hacky take-one-indexer-down-at-a-time approach to ensure your old data is searchable at all times during the upgrade.

You could move around step 3 since it only affects new data. Also, there may be a bug in the splunk offline command. In which case, you could just replace that with ./splunk stop command. After a time out interval it should kick in the same bucket remedial activities.

View solution in original post

twinspop · ‎04-03-2017

Updating this issue: We've completed the indexer swap, and it went fairly well. We went with adding all the new servers, then offlining 1 "old" server at a time. Started with 5 old, added 7 new. Dropped the 5 old, replaced their drives, and added them back into the cluster. Then we rebalanced. No fatal problems here, but the rebalance command doesn't always run to completion. Sometimes it progresses nicely, other times it showed 0.7% complete after 72 hours.

sk314 · ‎11-17-2016

My two cents:

Without 6.5's magical rebalance cluster unicorn command:

Add the new indexers
Leverage indexer discovery and weighted load balancing to drive all traffic to new indexers.
Change your RF and SF as this affects only new data coming in, not the past data.
Take down one old indexers (./splunk offline --enforce-counts)
Wait for the buckets to be redistributed between the old and new indexers (This might take time since only one copy is searchable)
Repeat Steps 4-5 for the rest of the indexers
Replace drives in old servers
Change the weighted load balancing factor from step 2 to send data across all indexers (or if you are picky, reverse the distribution factor from step 2 for some time so that the older indexers catch up with the new indexers)
Bask in the glory of a new indexer cluster with 140% more h/w and 500% more drive space. On SSDs.
Profit!

With 6.5

Add the new indexers
Leverage indexer discovery and weighted load balancing to drive all traffic to new indexers.
Change your RF and SF as this affects only new data coming in, not the past data.
Take down one old indexers (./splunk offline --enforce-counts)
Wait for the buckets to be redistributed between the old and new indexers (This might take time since only one copy is searchable)
Repeat Steps 4-5 for the rest of the indexers
Replace drives in old servers
Rebalance
Change the weighted load balancing factor from step 2 to uniformly send data across all indexers
Bask in the glory of a new indexer cluster with 140% more h/w and 500% more drive space. On SSDs.
Profit!

The only difference between the two approaches is that with 6.5 you have the flexibility to rebalance AFTER you add disks to the old servers. You still need to rebalance using the hacky take-one-indexer-down-at-a-time approach to ensure your old data is searchable at all times during the upgrade.

You could move around step 3 since it only affects new data. Also, there may be a bug in the splunk offline command. In which case, you could just replace that with ./splunk stop command. After a time out interval it should kick in the same bucket remedial activities.

lguinn2 · ‎01-02-2018

@twinspop - Correct. Changing the factors does make the cluster take action to become valid/complete with the new factors. This can cause a lot of recovery activity in an existing cluster.

twinspop · ‎11-18-2016

Thank you, sir. I don't use indexer disco, but I don't think step 2 is vital. Correct? And step 3... are you saying that changing RF/SF has no effect on already stored data? If it's accurate, that's a surprise, but good to know!

sk314 · ‎11-18-2016

You are right, it's not vital. You can control the rate at which different sets of indexers fill up over time to get some sort of eventually-balanced disk usage. As far as search factor is concerned, I remember reading something like that in the docs, however, I am not able to find the reference now. Will post it here if I find it.

twinspop · ‎11-20-2016

I built out a cluster in a test/lab scenario. Changing the RF and/or SF on the CM forces the change on all buckets, not just new ones. As far as I can tell. 🙂 The process above worked as planned.

What's the best method to update/replace indexer cluster members?

Routing logs with Splunk OTel Collector for Kubernetes

Welcome to the Splunk Community!

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM