Deployment Architecture

data rebalance progresses is very poor or getting stuck

rbal_splunk
Splunk Employee
Splunk Employee

On Splunk version 6.6.6 we have slow rebalance.
Environment has whose 18 idx multi-site cluster shows data rebalance progresses about .05% per day. There is about 15-25k buckets per Indexer, RF and SF are met, there are no fixup tasks, their hot+cold are on Solid State, network is on 10gig fiber connection. We also cleared all excess buckets. any thoughts?
What could be somethings blocking the progress?

rbal_splunk
Splunk Employee
Splunk Employee

Additional Points to consider:

Data rebalancing attempts to achieve an optimal balance, not a perfect balance
rebalance per-index data based on specific index buckets number rather than total buckets number

0 Karma

rbal_splunk
Splunk Employee
Splunk Employee

1) Check for excess buckets and remove them. Rebalance does not run until we are done with removing excess buckets to prevent unnecessary rebalancing.

https://CLUSTER_MASTER:8000/en-US/manager/system/clustering_bucket_details?tab=excess-buckets-tab
index=_internal sourcetype=metrics group=subtask_counts name=cmmaster_service to_fix_excess

2) Check that if CMRepJob, CMChangeBucketjob_build, CMTruncJob jobs are running. If there are no buckets and no jobs running related to data rebalance, then something is blocking the rebalance process.

index=_internal source=splunkd.log host=CM CMRepJob running job | timechart count by job

3) Next check this endpoint to see if any jobs are stuck here. If you see a bucket stuck here with the initial reason as "rebalance cluster buckets" and the latest reason as "bucket has a pending discard peer..." then you will need to resync the buckets against the discarding peer.

curl -k -u ADMIN https://CLUSTER_MASTER:8089/services/cluster/master/fixup?level=remove_excess
So now find the buckets under pending discard and look for the peer that lists state=PendingDiscard.

curl -k -u ADMIN https://CLUSTER_MASTER:8089//services/cluster/master/buckets?filter=status=PendingDiscard

Now resync the buckets against the discarding peers.

curl -k -u ADMIN https://CLUSTER_MASTER:8089/services/cluster/master/control/control/resync_bucket_from_peer -d bucket_id=customer_metrics~260~0496D8DF-7666-48F1-8E98-2F5355493040 -d peer=C449637C-3731-4498-BA9F-DEB0A05B347B

4) If no buckets are in stuck in pending discard, check "latest reason" at this endpoint for clues.

curl -k -u ADMIN https://CLUSTER_MASTER:8089//services/cluster/master/fixup?level=rebalance

5) Also check for stale replications that could be blocking current replications.

curl -k -u ADMIN https://CLUSTER_MASTER:8089//services/cluster/master/replications

6) If all else fails, try to trigger rebalance again.

dvg06
Path Finder

"resync the buckets against the discarding peers." worked for me
Thanks @rbal_splunk

0 Karma

sloshburch
Splunk Employee
Splunk Employee

@dvg06 - if it worked, you should be able to let others know by 'accepting' the answer @rpal posted. Cool?

0 Karma

dvg06
Path Finder

@SloshBurch - Happy to accept it as an answer, but I dont see an option to accept this as an answer, can you help me where to find that?
I did upvote the answer though....

0 Karma

sloshburch
Splunk Employee
Splunk Employee

Oh, my mistake. It looks like @rbal asked the originating question and therefore is the one to 'accept'.

0 Karma
Get Updates on the Splunk Community!

Adoption of RUM and APM at Splunk

    Unleash the power of Splunk Observability   Watch Now In this can't miss Tech Talk! The Splunk Growth ...

March Community Office Hours Security Series Uncovered!

Hello Splunk Community! In March, Splunk Community Office Hours spotlighted our fabulous Splunk Threat ...

Stay Connected: Your Guide to April Tech Talks, Office Hours, and Webinars!

Take a look below to explore our upcoming Community Office Hours, Tech Talks, and Webinars in April. This post ...