Rolling restart of multisite cluster not working a...

sini · ‎01-17-2023

Hi all,

We have an issue with our environment (all running 8.2.8 currently on Windows Server 2016 Standard). We have a multisite Indexer cluster consisting of 2 Sites with 2 Indexers in each site and a separate 2 node Searchhead cluster, 1 Indexer Master node. Replication is working as expected and when manually taking nodes offline, search and data durability are effected as desired (for ex. if you take a whole site or a single node in a site offline, everything is still searchable).

When a configuration bundle is deployed via the Master node, which requires a restart, all indexers in both sites will restart at the same time interrupting all searches.

The following values are present in server.conf (using btool) on the Master node:

[clustering]
percent_peers_to_restart = 10
restart_timeout = 60
rolling_restart = restart
rolling_restart_condition = batch_adding
replication_factor = 2
site_replication_factor = origin:1,total:2
site_search_factor = origin:1,total:2

On top of that and also unpleasant is that in many cases for ex. when appyling changes to props.conf for existing stanzas via the Master node, the indexers will restart although bundle validation on the Master returned that a restart is not required.

According to forum posts the issue should have been fixed in the 6.5.2 release. This environment however was base installed with 7.x so it cannot be an issue which would have been carried along through upgrades.

Any thoughts appreciated.

Many thanks and regards

PaulPanther · ‎01-17-2023

Do you see any timeouts in your logs during rolling restart? You should check how long a peer needs for its restart and may increase the parameter restart_timeout to a proper value.

Furthermore you could try to change the parameter rolling_restart from restart to either searchable or searchable_force.

sini · ‎01-17-2023

Hi,

@PaulPanther wrote:
Do you see any timeouts in your logs during rolling restart? You should check how long a peer needs for its restart and may increase the parameter restart_timeout to a proper value.

They all restart at exactly the same time. It's not like the Master waits 60 seconds before continuing, but I increased the value to 300 seconds.

@PaulPanther wrote:
Furthermore you could try to change the parameter rolling_restart from restart to either searchable or searchable_force.

Thanks for the hint, I'll try "searchable" and post an update.

Regards

Rolling restart of multisite cluster not working as expected

administration

configuration

Routing logs with Splunk OTel Collector for Kubernetes

Welcome to the Splunk Community!

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM