We recently resized our indexer cluster from a 3 node to a 4 node. We've ran the "rebalance" command from the master node to begin rebalance our data across the cluster, but have been seeing dismal performance. After running the rebalance command for over 14 hours, we've only see ~3000 buckets be moved to the new indexer. Here's where they're currently at:
As the documentation regarding this data migration states:
"Data rebalancing can cause primary bucket copies to move to new peers, so search results are not guaranteed to be complete while data rebalancing continues." (source)
We've been having to do this in the evenings during non-business hours, but at this rate it will take over a month (33 days to be exact) for our data to normalize and we will risk have inconsistent results on ALL alerts/reports run during off hours during that period.
I see there are two parameters in the server.conf file related to this process "max_peer_rep_load" and "max_peer_build_load" but from the documentation. I'm unclear as to what the impact would be to increase those.
Any thoughts as to how to make this process less painful? Thanks!
I am still going through this for a conclusive answer, Splunk KB has limited information.
I have noticed that we could run this rebalance indexer by indexer -action -indexer (path) . You could also initiate the same via the UI in the Cluster Master console, >> Indexing Cluster - master node, Edit info, Data rebalance and select indexer by indexer. This is way faster than the CLI rebalance command.
Please post a new question for this
@ejharts2015 - Did one of the answers below help provide a solution your question? If yes, please click “Accept” below the best answer to resolve this post and upvote anything that was helpful. If no, please leave a comment with more feedback. Thanks.
No they did not. We just waited the month, letting it run via cron from 9pm to 6am every night.
Splunk Data re-balancing functionality is fairly new and splunk metrics is yet to be enhanced to improve reporting on it.
Cluster Master main job is to orchestrate the bucket replication and other activities between indexer while staying compliant for RF and SF. Splunk’s Cluster Master perform each activity in form of job. The processing is done is such an order that critical activities like meeting ‘Replication_factor’ and ‘Seach_factor’ had highest prorioty and ‘-To_fix_rebalance’ has the lowest. Which means that it will re-balance the bucket only as and when it gets cycles? As it’s our highest priority to meet the RF and SF so that entire data remain searchable.
-streaming
-Data_safety
-Replication_factor
-Seach_factor
-Generate
-sych
-To_fix_excess
-To_fix_summary
-To_fix_rebalance
For better performance recommendation will be spec the Cluster Master and Cluster peer as per Splunk recommended guideline.
What is the retention on the affected indexes? In other words, why rebalance buckets for indexes with a retention of 30 days or less? As the old data ages out, it will solve the problem without intervention. This certainly seems less painful than running the data balancing, even if you can get the buckets rebalanced in a couple of weeks instead of 33 days.
Also, does it really matter if older buckets are "unbalanced"? In many environments, the vast majority of searches only run over the last 24 hours. If you have been running the 4-indexer cluster for more than 24 hours (as obviously you have), the most recent data is already balanced.
What benefit do you want to get from data rebalancing?
How much data are you moving? Copying data from one machine to another is a pretty high-overhead process.
Finally, I know this is not really an answer to your question. Sorry about that. But I don't know if there is a way to make the actual process faster...
Our retention is at least a year on all our indexes, some are longer for compliance.
Imbalances regarding the data DOES matter because the load against the boxes are heavily weighted to the old indexers and the new indexer is hardly getting used at all. Plus the added affect of having an imbalanced disk space.
For example:
Indexer 1, 2, 3 (old indexers) all currently have 6 TB plus, vs the new indexer which has a whopping 700 GB on it. As Splunk doesn't forward data based on disk space, our old indexers are at risk of filling up their hard drives before our new one even gets close.
Most of these issues are discussed in the documentation I linked above: http://docs.splunk.com/Documentation/Splunk/6.5.1/Indexer/Rebalancethecluster
A balanced set of bucket copies optimizes each peer's search load and, in the case of data rebalancing, each peer's disk storage.