Getting Data In

What's the best method to update/replace indexer cluster members?

twinspop
Influencer

We will be getting another batch of indexers in shortly, and each will have substantially more drive space than the old, 20x1.6TB SSD vs 20x600GB spinny. We will have more new servers than old. And we have approval to swap the drives in the old servers for new ones. (The h/w is otherwise the same. 20 core xeon, 128 GB. HP DL380G9.) Christmas is coming early. Woot woot.

Sketchy plan follows:

  1. Increase RF and SF from RF=2/SF=1 to RF=3/SF=2
  2. Wait for this move to settle
  3. Shutdown an 'old' server
  4. Slot in a new server to replace the old
  5. Wait for Cluster magix; RF and SF met.
  6. Optionally use 6.5.0's new goodness to rebalance data. Maybe?
  7. Repeat 2-5 until all the old servers are OOS and new ones are in place
  8. Replace drives in old servers
  9. Add all old servers to the cluster as new members
  10. Rebalance data one last time
  11. Bask in the glory of a new indexer cluster with 140% more h/w and 500% more drive space. On SSDs.

(repeat entire process for our other cluster)

1 Solution

sk314
Builder

My two cents:

Without 6.5's magical rebalance cluster unicorn command:

  1. Add the new indexers
  2. Leverage indexer discovery and weighted load balancing to drive all traffic to new indexers.
  3. Change your RF and SF as this affects only new data coming in, not the past data.
  4. Take down one old indexers (./splunk offline --enforce-counts)
  5. Wait for the buckets to be redistributed between the old and new indexers (This might take time since only one copy is searchable)
  6. Repeat Steps 4-5 for the rest of the indexers
  7. Replace drives in old servers
  8. Change the weighted load balancing factor from step 2 to send data across all indexers (or if you are picky, reverse the distribution factor from step 2 for some time so that the older indexers catch up with the new indexers)
  9. Bask in the glory of a new indexer cluster with 140% more h/w and 500% more drive space. On SSDs.
  10. Profit!

With 6.5

  1. Add the new indexers
  2. Leverage indexer discovery and weighted load balancing to drive all traffic to new indexers.
  3. Change your RF and SF as this affects only new data coming in, not the past data.
  4. Take down one old indexers (./splunk offline --enforce-counts)
  5. Wait for the buckets to be redistributed between the old and new indexers (This might take time since only one copy is searchable)
  6. Repeat Steps 4-5 for the rest of the indexers
  7. Replace drives in old servers
  8. Rebalance
  9. Change the weighted load balancing factor from step 2 to uniformly send data across all indexers
  10. Bask in the glory of a new indexer cluster with 140% more h/w and 500% more drive space. On SSDs.
  11. Profit!

The only difference between the two approaches is that with 6.5 you have the flexibility to rebalance AFTER you add disks to the old servers. You still need to rebalance using the hacky take-one-indexer-down-at-a-time approach to ensure your old data is searchable at all times during the upgrade.

You could move around step 3 since it only affects new data. Also, there may be a bug in the splunk offline command. In which case, you could just replace that with ./splunk stop command. After a time out interval it should kick in the same bucket remedial activities.

View solution in original post

twinspop
Influencer

Updating this issue: We've completed the indexer swap, and it went fairly well. We went with adding all the new servers, then offlining 1 "old" server at a time. Started with 5 old, added 7 new. Dropped the 5 old, replaced their drives, and added them back into the cluster. Then we rebalanced. No fatal problems here, but the rebalance command doesn't always run to completion. Sometimes it progresses nicely, other times it showed 0.7% complete after 72 hours.

0 Karma

sk314
Builder

My two cents:

Without 6.5's magical rebalance cluster unicorn command:

  1. Add the new indexers
  2. Leverage indexer discovery and weighted load balancing to drive all traffic to new indexers.
  3. Change your RF and SF as this affects only new data coming in, not the past data.
  4. Take down one old indexers (./splunk offline --enforce-counts)
  5. Wait for the buckets to be redistributed between the old and new indexers (This might take time since only one copy is searchable)
  6. Repeat Steps 4-5 for the rest of the indexers
  7. Replace drives in old servers
  8. Change the weighted load balancing factor from step 2 to send data across all indexers (or if you are picky, reverse the distribution factor from step 2 for some time so that the older indexers catch up with the new indexers)
  9. Bask in the glory of a new indexer cluster with 140% more h/w and 500% more drive space. On SSDs.
  10. Profit!

With 6.5

  1. Add the new indexers
  2. Leverage indexer discovery and weighted load balancing to drive all traffic to new indexers.
  3. Change your RF and SF as this affects only new data coming in, not the past data.
  4. Take down one old indexers (./splunk offline --enforce-counts)
  5. Wait for the buckets to be redistributed between the old and new indexers (This might take time since only one copy is searchable)
  6. Repeat Steps 4-5 for the rest of the indexers
  7. Replace drives in old servers
  8. Rebalance
  9. Change the weighted load balancing factor from step 2 to uniformly send data across all indexers
  10. Bask in the glory of a new indexer cluster with 140% more h/w and 500% more drive space. On SSDs.
  11. Profit!

The only difference between the two approaches is that with 6.5 you have the flexibility to rebalance AFTER you add disks to the old servers. You still need to rebalance using the hacky take-one-indexer-down-at-a-time approach to ensure your old data is searchable at all times during the upgrade.

You could move around step 3 since it only affects new data. Also, there may be a bug in the splunk offline command. In which case, you could just replace that with ./splunk stop command. After a time out interval it should kick in the same bucket remedial activities.

lguinn2
Legend

@twinspop - Correct. Changing the factors does make the cluster take action to become valid/complete with the new factors. This can cause a lot of recovery activity in an existing cluster.

0 Karma

twinspop
Influencer

Thank you, sir. I don't use indexer disco, but I don't think step 2 is vital. Correct? And step 3... are you saying that changing RF/SF has no effect on already stored data? If it's accurate, that's a surprise, but good to know!

0 Karma

sk314
Builder

You are right, it's not vital. You can control the rate at which different sets of indexers fill up over time to get some sort of eventually-balanced disk usage. As far as search factor is concerned, I remember reading something like that in the docs, however, I am not able to find the reference now. Will post it here if I find it.

0 Karma

twinspop
Influencer

I built out a cluster in a test/lab scenario. Changing the RF and/or SF on the CM forces the change on all buckets, not just new ones. As far as I can tell. 🙂 The process above worked as planned.

0 Karma
Get Updates on the Splunk Community!

Routing logs with Splunk OTel Collector for Kubernetes

The Splunk Distribution of the OpenTelemetry (OTel) Collector is a product that provides a way to ingest ...

Welcome to the Splunk Community!

(view in My Videos) We're so glad you're here! The Splunk Community is place to connect, learn, give back, and ...

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Elevating Digital Service Excellence: The Synergy of Real User Monitoring and Application Performance ...