Deployment Architecture
Highlighted

data rebalance progresses is very poor or getting stuck

Splunk Employee
Splunk Employee

On Splunk version 6.6.6 we have slow rebalance.
Environment has whose 18 idx multi-site cluster shows data rebalance progresses about .05% per day. There is about 15-25k buckets per Indexer, RF and SF are met, there are no fixup tasks, their hot+cold are on Solid State, network is on 10gig fiber connection. We also cleared all excess buckets. any thoughts?
What could be somethings blocking the progress?

Highlighted

Re: data rebalance progresses is very poor or getting stuck

Splunk Employee
Splunk Employee

1) Check for excess buckets and remove them. Rebalance does not run until we are done with removing excess buckets to prevent unnecessary rebalancing.

https://CLUSTER_MASTER:8000/en-US/manager/system/clustering_bucket_details?tab=excess-buckets-tab
index=internal sourcetype=metrics group=subtaskcounts name=cmmasterservice tofix_excess

2) Check that if CMRepJob, CMChangeBucketjob_build, CMTruncJob jobs are running. If there are no buckets and no jobs running related to data rebalance, then something is blocking the rebalance process.

index=_internal source=splunkd.log host=CM CMRepJob running job | timechart count by job

3) Next check this endpoint to see if any jobs are stuck here. If you see a bucket stuck here with the initial reason as "rebalance cluster buckets" and the latest reason as "bucket has a pending discard peer..." then you will need to resync the buckets against the discarding peer.

curl -k -u ADMIN https://CLUSTER_MASTER:8089/services/cluster/master/fixup?level=remove_excess
So now find the buckets under pending discard and look for the peer that lists state=PendingDiscard.

curl -k -u ADMIN https://CLUSTER_MASTER:8089//services/cluster/master/buckets?filter=status=PendingDiscard

Now resync the buckets against the discarding peers.

curl -k -u ADMIN https://CLUSTER_MASTER:8089/services/cluster/master/control/control/resync_bucket_from_peer -d bucketid=customermetrics~260~0496D8DF-7666-48F1-8E98-2F5355493040 -d peer=C449637C-3731-4498-BA9F-DEB0A05B347B

4) If no buckets are in stuck in pending discard, check "latest reason" at this endpoint for clues.

curl -k -u ADMIN https://CLUSTER_MASTER:8089//services/cluster/master/fixup?level=rebalance

5) Also check for stale replications that could be blocking current replications.

curl -k -u ADMIN https://CLUSTER_MASTER:8089//services/cluster/master/replications

6) If all else fails, try to trigger rebalance again.

Highlighted

Re: data rebalance progresses is very poor or getting stuck

Path Finder

"resync the buckets against the discarding peers." worked for me
Thanks @rbal_splunk

0 Karma
Highlighted

Re: data rebalance progresses is very poor or getting stuck

Ultra Champion

@dvg06 - if it worked, you should be able to let others know by 'accepting' the answer @rpal posted. Cool?

0 Karma
Highlighted

Re: data rebalance progresses is very poor or getting stuck

Path Finder

@SloshBurch - Happy to accept it as an answer, but I dont see an option to accept this as an answer, can you help me where to find that?
I did upvote the answer though....

0 Karma
Highlighted

Re: data rebalance progresses is very poor or getting stuck

Ultra Champion

Oh, my mistake. It looks like @rbal asked the originating question and therefore is the one to 'accept'.

0 Karma
Highlighted

Re: data rebalance progresses is very poor or getting stuck

Splunk Employee
Splunk Employee

Additional Points to consider:

Data rebalancing attempts to achieve an optimal balance, not a perfect balance
rebalance per-index data based on specific index buckets number rather than total buckets number

0 Karma