Splunk Enterprise

Rolling restart of multisite cluster not working as expected

sini
Explorer

Hi all,

We have an issue with our environment (all running 8.2.8 currently on Windows Server 2016 Standard). We have a multisite Indexer cluster consisting of 2 Sites with 2 Indexers in each site and a separate 2 node Searchhead cluster, 1 Indexer Master node. Replication is working as expected and when manually taking nodes offline, search and data durability are effected as desired (for ex. if you take a whole site or a single node in a site offline, everything is still searchable).

When a configuration bundle is deployed via the Master node, which requires a restart, all indexers in both sites will restart at the same time interrupting all searches.

The following values are present in server.conf (using btool) on the Master node:

[clustering]
percent_peers_to_restart = 10
restart_timeout = 60
rolling_restart = restart
rolling_restart_condition = batch_adding
replication_factor = 2
site_replication_factor = origin:1,total:2
site_search_factor = origin:1,total:2

On top of that and also unpleasant is that in many cases for ex. when appyling changes to props.conf for existing stanzas via the Master node, the indexers will restart although bundle validation on the Master returned that a restart is not required.

According to forum posts the issue should have been fixed in the 6.5.2 release. This environment however was base installed with 7.x so it cannot be an issue which would have been carried along through upgrades.

Any thoughts appreciated.

Many thanks and regards

 

 

Labels (2)
0 Karma

PaulPanther
Builder

Do you see any timeouts in your logs during rolling restart? You should check how long a peer needs for its restart and may increase the parameter restart_timeout to a proper value.

Furthermore you could try to change the parameter rolling_restart from restart to either searchable or searchable_force.

0 Karma

sini
Explorer

Hi,


@PaulPanther wrote:

Do you see any timeouts in your logs during rolling restart? You should check how long a peer needs for its restart and may increase the parameter restart_timeout to a proper value.

They all restart at exactly the same time. It's not like the Master waits 60 seconds before continuing, but I increased the value to 300 seconds.


@PaulPanther wrote:

Furthermore you could try to change the parameter rolling_restart from restart to either searchable or searchable_force.


Thanks for the hint, I'll try "searchable" and post an update.

Regards

Get Updates on the Splunk Community!

Routing logs with Splunk OTel Collector for Kubernetes

The Splunk Distribution of the OpenTelemetry (OTel) Collector is a product that provides a way to ingest ...

Welcome to the Splunk Community!

(view in My Videos) We're so glad you're here! The Splunk Community is place to connect, learn, give back, and ...

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Elevating Digital Service Excellence: The Synergy of Real User Monitoring and Application Performance ...