Solved: Why is Data Rebalance 'failing' after a couple min...

joefuguet · ‎06-23-2022

We recently deployed 5 new indexers into site 2 our 2-site clustered environment to replace 5 old ones in the same site (2). We have offlined the old indexers and I am now attempting to rebalance the cluster.

I will note that a large amount of bucket fixing activities are taking place currently, as the new indexers in site 2 are copying buckets from site 1 to reestablish data redundancy.

The problem is: When attempting to run a rebalance operation in the GUI from the cluster master, it will begin the rebalance successfully. A couple minutes to an hour go by while the completion % slowly climbs. This is demonstrated in splunkd.log:

06-23-2022 10:19:32.148 -0400 INFO CMMaster - data rebalance started, initial_work=900897
06-23-2022 10:19:32.148 -0400 INFO CMMaster - data rebalance completion percent=0.00
06-23-2022 10:20:02.534 -0400 INFO CMMaster - data rebalance completion percent=1.90
06-23-2022 10:20:32.893 -0400 INFO CMMaster - data rebalance completion percent=1.90
06-23-2022 09:51:49.099 -0400 INFO CMMaster - data rebalance completion percent=3.05
06-23-2022 09:52:21.558 -0400 INFO CMMaster - data rebalance completion percent=3.06

Then, seemingly at random, I get this error message in the logs, and the rebalance suddenly stops.

06-23-2022 10:04:58.657 -0400 INFO FixupStrategy - rebalance skipped all buckets, forcing a stop

06-23-2022 10:04:59.189 -0400 INFO CMMaster - data rebalance complete! percent=100.00

Searching the internet did not yield any results for this error message. does anyone know what could be causing my rebalance to skip all buckets?

isoutamo · ‎06-28-2022

My experience with that "searchable=yes" is that you will still get some messages that "you cannot search everything" (multisite 10+ idx nodes). So I'm not sure if it works or not.

View solution in original post

securitypaul · ‎07-28-2022

If you are using SmartStore, be sure to avoid ticking the Searchable option when running a Data Rebalance.

"Do not use searchable data rebalance with SmartStore indexes. Searchable mode is not optimized for SmartStore and can cause data rebalance to proceed slowly. Use non-searchable data rebalance instead.

In any case, non-searchable data rebalance of SmartStore indexes usually causes only minimal search disruption. The data rebalance process runs quickly on SmartStore indexes, because it moves only bucket metadata, not the bucket data itself."

https://docs.splunk.com/Documentation/Splunk/9.0.0/Indexer/Rebalancethecluster

joefuguet · ‎07-28-2022

Thanks Paul - I did end up running the rebalance again a couple weekends back with "searchable=no". Some of our indexes are smart store so maybe this was part of it, Worked fine this time. Best guess is there was some conflict between the auto bucket fixing from adding new indexers to the cluster while simultaneously trying to rebalance.

securitypaul · ‎07-28-2022

I do wish that warnings like this were displayed in the GUI. It would make life easier.

isoutamo · ‎06-28-2022

Hi

I haven't seen that message when doing a rebalance. Can you check/do the next:

Check that there is enough space on all nodes
Check that IDX cluster is ok without any errors
Remove all excess buckets
Ensure that you are doing data rebalancing not rebalancing just primaries
Is the same situation if you rebalance only one index
Have you running it with max_runtime
What you have in rebalance_search_completion_timeout
What is your threshold and is there difference if/when you are changing it

r. Ismo

joefuguet · ‎06-28-2022

Hi Isoutamo,

Thanks for the response! Let me answer which questions I am able to from my last attempt to rebalance the cluster over the weekend. I will not be able to run another attempt until this weekend, but here's what I got:

1. Space is definitely sufficient on all nodes (upgraded to machines with double the storage capacity)

2. IDX cluster status was NOT ok at the time of the rebalance attempt. It was rebalancing primaries (edit: I mean, automatic bucket fixing was occurring due to adding new indexers) at the same time as my rebalance attempt. I feel that this was most likely the problem upon reflection - rebalancing the indexes during primary rebalance does not make much sense on my part.

3. This was done

4. As stated above these operations were occurring simultaneously

5. Can test upcoming weekend

6. This was left default (blank)

7. Unsure - if this corresponds to "Searchable" GUI option, this was NOT selected.

8. Tried with multiple thresholds, 0.9, 0.85, 0.8 with same result

isoutamo · ‎06-28-2022

7. It's just that.

https://docs.splunk.com/Documentation/Splunk/9.0.0/Indexer/Rebalancethecluster

And if possible it's probably good idea to do a rolling restart before start rebalancing. At least some REST endpoints has given wrong values if there have been any disk issues after restart. I'm not sure if this is still valid with the newer version (than 7.3.3).

joefuguet · ‎06-28-2022

Thank you Sir - Since my original message, all automatic bucket fixing has finished, so now the cluster is in a stable status. I will take your recommendation to perform a rolling restart this weekend and then attempt another rebalance, this time with searchable=yes.

isoutamo · ‎06-28-2022

My experience with that "searchable=yes" is that you will still get some messages that "you cannot search everything" (multisite 10+ idx nodes). So I'm not sure if it works or not.

joefuguet · ‎07-28-2022

Thanks for this, to provide a delayed update I reran the rebalance a couple weeks back and it worked fine this time... I did run with "searchable=no" this time. Perhaps there was some conflict between the rebalance and the automatic bucket fixing that was taking place since I had just joined multiple indexers to our cluster.

Why is Data Rebalance 'failing' after a couple minutes?

indexer clustering

Linux

Enterprise Security Content Update (ESCU) | New Releases

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

Index This | What are the 12 Days of Splunk-mas?