On Splunk version 6.6.6 we have slow rebalance.
Environment has whose 18 idx multi-site cluster shows data rebalance progresses about .05% per day. There is about 15-25k buckets per Indexer, RF and SF are met, there are no fixup tasks, their hot+cold are on Solid State, network is on 10gig fiber connection. We also cleared all excess buckets. any thoughts?
What could be somethings blocking the progress?
Additional Points to consider:
Data rebalancing attempts to achieve an optimal balance, not a perfect balance
rebalance per-index data based on specific index buckets number rather than total buckets number
1) Check for excess buckets and remove them. Rebalance does not run until we are done with removing excess buckets to prevent unnecessary rebalancing.
index=_internal sourcetype=metrics group=subtask_counts name=cmmaster_service to_fix_excess
2) Check that if CMRepJob, CMChangeBucketjob_build, CMTruncJob jobs are running. If there are no buckets and no jobs running related to data rebalance, then something is blocking the rebalance process.
index=_internal source=splunkd.log host=CM CMRepJob running job | timechart count by job
3) Next check this endpoint to see if any jobs are stuck here. If you see a bucket stuck here with the initial reason as "rebalance cluster buckets" and the latest reason as "bucket has a pending discard peer..." then you will need to resync the buckets against the discarding peer.
curl -k -u ADMIN https://CLUSTER_MASTER:8089/services/cluster/master/fixup?level=remove_excess
So now find the buckets under pending discard and look for the peer that lists state=PendingDiscard.
Now resync the buckets against the discarding peers.
curl -k -u ADMIN https://CLUSTER_MASTER:8089/services/cluster/master/control/control/resync_bucket_from_peer -d bucket_id=customer_metrics~260~0496D8DF-7666-48F1-8E98-2F5355493040 -d peer=C449637C-3731-4498-BA9F-DEB0A05B347B
4) If no buckets are in stuck in pending discard, check "latest reason" at this endpoint for clues.
5) Also check for stale replications that could be blocking current replications.
curl -k -u ADMIN https://CLUSTER_MASTER:8089//services/cluster/master/replications
6) If all else fails, try to trigger rebalance again.