Hello,
I'm not finding info on the limits within Splunk's data rebalancing. Some context, I have ~40 indexers and stood up 8 new ones. The 40 old ones had an avg of ~150k buckets each. At some point the rebalance reported that it was completed (above the .9 threshold) even though there were only ~40k buckets on the new indexers. When I kicked off a second rebalance, it started from 20% again and continued rebalancing because the new indexers were NOT space limited on the smartstore caches yet. The timeout was set to 11 hours and the first one finished in ~4. The master did not restart during this balancing.
Can anyone shed some more light on why the first rebalance died? Like, is there a 350k bucket limit per rebalance or something?
When the cluster meets the minimum threshold (e.g., 0.9 balance), the rebalance process considers its job “done,” even if distribution across the newest indexers still isn’t as even as expected. That’s why the first rebalance stopped after ~4 hours, while the second one restarted from ~20% and continued moving more buckets. Essentially, Splunk rebalancing is designed to gradually optimize data distribution while minimizing cluster load, not necessarily to perfectly even out every run.
That's true. Then you must remember that rebalancing just count number of buckets when it does its work. Because buckets can have different sizes the disk space usage is not rebalanced just count of those.
In rebalancing there are two options:
1st one is done automatically in quite many situations e.g. rolling restart etc.
2nd one is always manual work which target it set to 90% level.
What I have done by myself is modify that %-level. Depending on environment I have used e.g. 95-99% to get better distribution of buckets over nodes. After you have gotten suitable level, you should adjust that % level back to 90%.
Hi @dersonje2
Splunk doesn’t impose a hard 350 K buckets per rebalance limit, it sounds like your first rebalance simply hit the default/configured rebalance-threshold and the master declared “good enough” and stopped. By default the master will stop moving buckets once within 90 % of the ideal distribution, with over 40 indexers before adding the new ones, I guess in theory that could mean a pretty small number of buckets will end up on the new indexers as it will have started at ~80% distribution.
If you want to make the spread more even then increase the rebalance_threshold within the [clustering] stanza in server.conf to a number closer to 1, where 1=100% distribution. This might improve the distribution you are getting.
🌟 Did this answer help you? If so, please consider:
Your feedback encourages the volunteers in this community to continue contributing
Thanks for confirming there shouldn't be a limit. I agree the cluster master decided it was good enough but I don't understand how it could have hit an "ideal distribution" and then minutes later in another balancing run was able to recognize that another ~40k+ buckets per indexer needed to be moved to the same indexers again. It isn't too important because I just restarted the balancing runs until it was actually balanced, but it makes me wonder if this is the only bucket based operation that has gremlins.