Deployment Architecture

Why is Data Rebalance 'failing' after a couple minutes?

joefuguet
Explorer

We recently deployed 5 new indexers into site 2 our 2-site clustered environment to replace 5 old ones in the same site (2). We have offlined the old indexers and I am now attempting to rebalance the cluster.

 

I will note that a large amount of bucket fixing activities are taking place currently, as the new indexers in site 2 are copying buckets from site 1 to reestablish data redundancy.

 

The problem is: When attempting to run a rebalance operation in the GUI from the cluster master, it will begin the rebalance successfully. A couple minutes to an hour go by while the completion % slowly climbs. This is demonstrated in splunkd.log:

06-23-2022 10:19:32.148 -0400 INFO CMMaster - data rebalance started, initial_work=900897
06-23-2022 10:19:32.148 -0400 INFO CMMaster - data rebalance completion percent=0.00
06-23-2022 10:20:02.534 -0400 INFO CMMaster - data rebalance completion percent=1.90
06-23-2022 10:20:32.893 -0400 INFO CMMaster - data rebalance completion percent=1.90
06-23-2022 09:51:49.099 -0400 INFO CMMaster - data rebalance completion percent=3.05
06-23-2022 09:52:21.558 -0400 INFO CMMaster - data rebalance completion percent=3.06

 

 

Then, seemingly at random, I get this error message in the logs, and the rebalance suddenly stops. 

 

06-23-2022 10:04:58.657 -0400 INFO FixupStrategy - rebalance skipped all buckets, forcing a stop

06-23-2022 10:04:59.189 -0400 INFO CMMaster - data rebalance complete! percent=100.00

Searching the internet did not yield any results for this error message. does anyone know what could be causing my rebalance to skip all buckets?

 

Labels (2)
Tags (1)
0 Karma
1 Solution

isoutamo
SplunkTrust
SplunkTrust
My experience with that "searchable=yes" is that you will still get some messages that "you cannot search everything" (multisite 10+ idx nodes). So I'm not sure if it works or not.

View solution in original post

securitypaul
Explorer

If you are using SmartStore, be sure to avoid ticking the Searchable option when running  a Data Rebalance.

"Do not use searchable data rebalance with SmartStore indexes. Searchable mode is not optimized for SmartStore and can cause data rebalance to proceed slowly. Use non-searchable data rebalance instead.

In any case, non-searchable data rebalance of SmartStore indexes usually causes only minimal search disruption. The data rebalance process runs quickly on SmartStore indexes, because it moves only bucket metadata, not the bucket data itself."

https://docs.splunk.com/Documentation/Splunk/9.0.0/Indexer/Rebalancethecluster

 

0 Karma

joefuguet
Explorer

Thanks Paul - I did end up running the rebalance again a couple weekends back with "searchable=no". Some of our indexes are smart store so maybe this was part of it, Worked fine this time. Best guess is there was some conflict between the auto bucket fixing from adding new indexers to the cluster while simultaneously trying to rebalance. 

 

0 Karma

securitypaul
Explorer

I do wish that warnings like this were displayed in the GUI. It would make life easier.

0 Karma

isoutamo
SplunkTrust
SplunkTrust

Hi

I haven't seen that message when doing a rebalance. Can you check/do the next:

  • Check that there is enough space on all nodes
  • Check that IDX cluster is ok without any errors
  • Remove all excess buckets
  • Ensure that you are doing data rebalancing not rebalancing just primaries
  • Is the same situation if you rebalance only one index
  • Have you running it with max_runtime
  • What you have in rebalance_search_completion_timeout
  • What is your threshold and is there difference if/when you are changing it

r. Ismo

joefuguet
Explorer

Hi Isoutamo, 

 

Thanks for the response! Let me answer which questions I am able to from my last attempt to rebalance the cluster over the weekend. I will not be able to run another attempt until this weekend, but here's what I got:

 

1. Space is definitely sufficient on all nodes (upgraded to machines with double the storage capacity)

2. IDX cluster status was NOT ok at the time of the rebalance attempt. It was rebalancing primaries (edit: I mean, automatic bucket fixing was occurring due to adding new indexers) at the same time as my rebalance attempt. I feel that this was most likely the problem upon reflection - rebalancing the indexes during primary rebalance does not make much sense on my part. 

3. This was done

4. As stated above these operations were occurring simultaneously

5. Can test upcoming weekend

6. This was left default (blank)

7. Unsure - if this corresponds to "Searchable" GUI option, this was NOT selected.

8. Tried with multiple thresholds, 0.9, 0.85, 0.8 with same result

0 Karma

isoutamo
SplunkTrust
SplunkTrust

7. It's just that.

https://docs.splunk.com/Documentation/Splunk/9.0.0/Indexer/Rebalancethecluster

And if possible it's probably good idea to do a rolling restart before start rebalancing. At least some REST endpoints has given wrong values if there have been any disk issues after restart. I'm not sure if this is still valid with the newer version (than 7.3.3).

joefuguet
Explorer

Thank you Sir - Since my original message, all automatic bucket fixing has finished, so now the cluster is in a stable status. I will take your recommendation to perform a rolling restart this weekend and then attempt another rebalance, this time with searchable=yes. 

0 Karma

isoutamo
SplunkTrust
SplunkTrust
My experience with that "searchable=yes" is that you will still get some messages that "you cannot search everything" (multisite 10+ idx nodes). So I'm not sure if it works or not.

joefuguet
Explorer

Thanks for this, to provide a delayed update I reran the rebalance a couple weeks back and it worked fine this time...  I did run with "searchable=no" this time. Perhaps there was some conflict between the rebalance and the automatic bucket fixing that was taking place since I had just joined multiple indexers to our cluster. 

0 Karma
Get Updates on the Splunk Community!

Stay Connected: Your Guide to July and August Tech Talks, Office Hours, and Webinars!

Dive into our sizzling summer lineup for July and August Community Office Hours and Tech Talks. Scroll down to ...

Edge Processor Scaling, Energy & Manufacturing Use Cases, and More New Articles on ...

Splunk Lantern is a Splunk customer success center that provides advice from Splunk experts on valuable data ...

Get More Out of Your Security Practice With a SIEM

Get More Out of Your Security Practice With a SIEMWednesday, July 31, 2024  |  11AM PT / 2PM ETREGISTER ...