We recently deployed 5 new indexers into site 2 our 2-site clustered environment to replace 5 old ones in the same site (2). We have offlined the old indexers and I am now attempting to rebalance the cluster.
I will note that a large amount of bucket fixing activities are taking place currently, as the new indexers in site 2 are copying buckets from site 1 to reestablish data redundancy.
The problem is: When attempting to run a rebalance operation in the GUI from the cluster master, it will begin the rebalance successfully. A couple minutes to an hour go by while the completion % slowly climbs. This is demonstrated in splunkd.log:
06-23-2022 10:19:32.148 -0400 INFO CMMaster - data rebalance started, initial_work=900897
06-23-2022 10:19:32.148 -0400 INFO CMMaster - data rebalance completion percent=0.00
06-23-2022 10:20:02.534 -0400 INFO CMMaster - data rebalance completion percent=1.90
06-23-2022 10:20:32.893 -0400 INFO CMMaster - data rebalance completion percent=1.90
06-23-2022 09:51:49.099 -0400 INFO CMMaster - data rebalance completion percent=3.05
06-23-2022 09:52:21.558 -0400 INFO CMMaster - data rebalance completion percent=3.06
Then, seemingly at random, I get this error message in the logs, and the rebalance suddenly stops.
06-23-2022 10:04:58.657 -0400 INFO FixupStrategy - rebalance skipped all buckets, forcing a stop
06-23-2022 10:04:59.189 -0400 INFO CMMaster - data rebalance complete! percent=100.00
Searching the internet did not yield any results for this error message. does anyone know what could be causing my rebalance to skip all buckets?
If you are using SmartStore, be sure to avoid ticking the Searchable option when running a Data Rebalance.
"Do not use searchable data rebalance with SmartStore indexes. Searchable mode is not optimized for SmartStore and can cause data rebalance to proceed slowly. Use non-searchable data rebalance instead.
In any case, non-searchable data rebalance of SmartStore indexes usually causes only minimal search disruption. The data rebalance process runs quickly on SmartStore indexes, because it moves only bucket metadata, not the bucket data itself."
https://docs.splunk.com/Documentation/Splunk/9.0.0/Indexer/Rebalancethecluster
Thanks Paul - I did end up running the rebalance again a couple weekends back with "searchable=no". Some of our indexes are smart store so maybe this was part of it, Worked fine this time. Best guess is there was some conflict between the auto bucket fixing from adding new indexers to the cluster while simultaneously trying to rebalance.
I do wish that warnings like this were displayed in the GUI. It would make life easier.
Hi
I haven't seen that message when doing a rebalance. Can you check/do the next:
r. Ismo
Hi Isoutamo,
Thanks for the response! Let me answer which questions I am able to from my last attempt to rebalance the cluster over the weekend. I will not be able to run another attempt until this weekend, but here's what I got:
1. Space is definitely sufficient on all nodes (upgraded to machines with double the storage capacity)
2. IDX cluster status was NOT ok at the time of the rebalance attempt. It was rebalancing primaries (edit: I mean, automatic bucket fixing was occurring due to adding new indexers) at the same time as my rebalance attempt. I feel that this was most likely the problem upon reflection - rebalancing the indexes during primary rebalance does not make much sense on my part.
3. This was done
4. As stated above these operations were occurring simultaneously
5. Can test upcoming weekend
6. This was left default (blank)
7. Unsure - if this corresponds to "Searchable" GUI option, this was NOT selected.
8. Tried with multiple thresholds, 0.9, 0.85, 0.8 with same result
7. It's just that.
https://docs.splunk.com/Documentation/Splunk/9.0.0/Indexer/Rebalancethecluster
And if possible it's probably good idea to do a rolling restart before start rebalancing. At least some REST endpoints has given wrong values if there have been any disk issues after restart. I'm not sure if this is still valid with the newer version (than 7.3.3).
Thank you Sir - Since my original message, all automatic bucket fixing has finished, so now the cluster is in a stable status. I will take your recommendation to perform a rolling restart this weekend and then attempt another rebalance, this time with searchable=yes.
Thanks for this, to provide a delayed update I reran the rebalance a couple weeks back and it worked fine this time... I did run with "searchable=no" this time. Perhaps there was some conflict between the rebalance and the automatic bucket fixing that was taking place since I had just joined multiple indexers to our cluster.