We will be getting another batch of indexers in shortly, and each will have substantially more drive space than the old, 20x1.6TB SSD vs 20x600GB spinny. We will have more new servers than old. And we have approval to swap the drives in the old servers for new ones. (The h/w is otherwise the same. 20 core xeon, 128 GB. HP DL380G9.) Christmas is coming early. Woot woot.
Sketchy plan follows:
(repeat entire process for our other cluster)
My two cents:
Without 6.5's magical rebalance cluster unicorn command:
With 6.5
The only difference between the two approaches is that with 6.5 you have the flexibility to rebalance AFTER you add disks to the old servers. You still need to rebalance using the hacky take-one-indexer-down-at-a-time approach to ensure your old data is searchable at all times during the upgrade.
You could move around step 3 since it only affects new data. Also, there may be a bug in the splunk offline command. In which case, you could just replace that with ./splunk stop command. After a time out interval it should kick in the same bucket remedial activities.
Updating this issue: We've completed the indexer swap, and it went fairly well. We went with adding all the new servers, then offlining 1 "old" server at a time. Started with 5 old, added 7 new. Dropped the 5 old, replaced their drives, and added them back into the cluster. Then we rebalanced. No fatal problems here, but the rebalance command doesn't always run to completion. Sometimes it progresses nicely, other times it showed 0.7% complete after 72 hours.
My two cents:
Without 6.5's magical rebalance cluster unicorn command:
With 6.5
The only difference between the two approaches is that with 6.5 you have the flexibility to rebalance AFTER you add disks to the old servers. You still need to rebalance using the hacky take-one-indexer-down-at-a-time approach to ensure your old data is searchable at all times during the upgrade.
You could move around step 3 since it only affects new data. Also, there may be a bug in the splunk offline command. In which case, you could just replace that with ./splunk stop command. After a time out interval it should kick in the same bucket remedial activities.
@twinspop - Correct. Changing the factors does make the cluster take action to become valid/complete with the new factors. This can cause a lot of recovery activity in an existing cluster.
Thank you, sir. I don't use indexer disco, but I don't think step 2 is vital. Correct? And step 3... are you saying that changing RF/SF has no effect on already stored data? If it's accurate, that's a surprise, but good to know!
You are right, it's not vital. You can control the rate at which different sets of indexers fill up over time to get some sort of eventually-balanced disk usage. As far as search factor is concerned, I remember reading something like that in the docs, however, I am not able to find the reference now. Will post it here if I find it.
I built out a cluster in a test/lab scenario. Changing the RF and/or SF on the CM forces the change on all buckets, not just new ones. As far as I can tell. 🙂 The process above worked as planned.