Deployment Architecture

data rebalance progresses is very poor or getting stuck

rbal_splunk
Splunk Employee
Splunk Employee

On Splunk version 6.6.6 we have slow rebalance.
Environment has whose 18 idx multi-site cluster shows data rebalance progresses about .05% per day. There is about 15-25k buckets per Indexer, RF and SF are met, there are no fixup tasks, their hot+cold are on Solid State, network is on 10gig fiber connection. We also cleared all excess buckets. any thoughts?
What could be somethings blocking the progress?

rbal_splunk
Splunk Employee
Splunk Employee

Additional Points to consider:

Data rebalancing attempts to achieve an optimal balance, not a perfect balance
rebalance per-index data based on specific index buckets number rather than total buckets number

0 Karma

rbal_splunk
Splunk Employee
Splunk Employee

1) Check for excess buckets and remove them. Rebalance does not run until we are done with removing excess buckets to prevent unnecessary rebalancing.

https://CLUSTER_MASTER:8000/en-US/manager/system/clustering_bucket_details?tab=excess-buckets-tab
index=_internal sourcetype=metrics group=subtask_counts name=cmmaster_service to_fix_excess

2) Check that if CMRepJob, CMChangeBucketjob_build, CMTruncJob jobs are running. If there are no buckets and no jobs running related to data rebalance, then something is blocking the rebalance process.

index=_internal source=splunkd.log host=CM CMRepJob running job | timechart count by job

3) Next check this endpoint to see if any jobs are stuck here. If you see a bucket stuck here with the initial reason as "rebalance cluster buckets" and the latest reason as "bucket has a pending discard peer..." then you will need to resync the buckets against the discarding peer.

curl -k -u ADMIN https://CLUSTER_MASTER:8089/services/cluster/master/fixup?level=remove_excess
So now find the buckets under pending discard and look for the peer that lists state=PendingDiscard.

curl -k -u ADMIN https://CLUSTER_MASTER:8089//services/cluster/master/buckets?filter=status=PendingDiscard

Now resync the buckets against the discarding peers.

curl -k -u ADMIN https://CLUSTER_MASTER:8089/services/cluster/master/control/control/resync_bucket_from_peer -d bucket_id=customer_metrics~260~0496D8DF-7666-48F1-8E98-2F5355493040 -d peer=C449637C-3731-4498-BA9F-DEB0A05B347B

4) If no buckets are in stuck in pending discard, check "latest reason" at this endpoint for clues.

curl -k -u ADMIN https://CLUSTER_MASTER:8089//services/cluster/master/fixup?level=rebalance

5) Also check for stale replications that could be blocking current replications.

curl -k -u ADMIN https://CLUSTER_MASTER:8089//services/cluster/master/replications

6) If all else fails, try to trigger rebalance again.

dvg06
Path Finder

"resync the buckets against the discarding peers." worked for me
Thanks @rbal_splunk

0 Karma

sloshburch
Ultra Champion

@dvg06 - if it worked, you should be able to let others know by 'accepting' the answer @rpal posted. Cool?

0 Karma

dvg06
Path Finder

@SloshBurch - Happy to accept it as an answer, but I dont see an option to accept this as an answer, can you help me where to find that?
I did upvote the answer though....

0 Karma

sloshburch
Ultra Champion

Oh, my mistake. It looks like @rbal asked the originating question and therefore is the one to 'accept'.

0 Karma
.conf21 CFS Extended through 5/20!

Don't miss your chance
to share your Splunk
wisdom in-person or
virtually at .conf21!

Call for Speakers has
been extended through
Thursday, 5/20!