Splunk Enterprise

Rolling restart of multisite cluster not working as expected

sini
Explorer

Hi all,

We have an issue with our environment (all running 8.2.8 currently on Windows Server 2016 Standard). We have a multisite Indexer cluster consisting of 2 Sites with 2 Indexers in each site and a separate 2 node Searchhead cluster, 1 Indexer Master node. Replication is working as expected and when manually taking nodes offline, search and data durability are effected as desired (for ex. if you take a whole site or a single node in a site offline, everything is still searchable).

When a configuration bundle is deployed via the Master node, which requires a restart, all indexers in both sites will restart at the same time interrupting all searches.

The following values are present in server.conf (using btool) on the Master node:

[clustering]
percent_peers_to_restart = 10
restart_timeout = 60
rolling_restart = restart
rolling_restart_condition = batch_adding
replication_factor = 2
site_replication_factor = origin:1,total:2
site_search_factor = origin:1,total:2

On top of that and also unpleasant is that in many cases for ex. when appyling changes to props.conf for existing stanzas via the Master node, the indexers will restart although bundle validation on the Master returned that a restart is not required.

According to forum posts the issue should have been fixed in the 6.5.2 release. This environment however was base installed with 7.x so it cannot be an issue which would have been carried along through upgrades.

Any thoughts appreciated.

Many thanks and regards

 

 

Labels (2)
0 Karma

PaulPanther
Motivator

Do you see any timeouts in your logs during rolling restart? You should check how long a peer needs for its restart and may increase the parameter restart_timeout to a proper value.

Furthermore you could try to change the parameter rolling_restart from restart to either searchable or searchable_force.

0 Karma

sini
Explorer

Hi,


@PaulPanther wrote:

Do you see any timeouts in your logs during rolling restart? You should check how long a peer needs for its restart and may increase the parameter restart_timeout to a proper value.

They all restart at exactly the same time. It's not like the Master waits 60 seconds before continuing, but I increased the value to 300 seconds.


@PaulPanther wrote:

Furthermore you could try to change the parameter rolling_restart from restart to either searchable or searchable_force.


Thanks for the hint, I'll try "searchable" and post an update.

Regards

Get Updates on the Splunk Community!

Developer Spotlight with Paul Stout

Welcome to our very first developer spotlight release series where we'll feature some awesome Splunk ...

State of Splunk Careers 2024: Maximizing Career Outcomes and the Continued Value of ...

For the past four years, Splunk has partnered with Enterprise Strategy Group to conduct a survey that gauges ...

Data-Driven Success: Splunk & Financial Services

Splunk streamlines the process of extracting insights from large volumes of data. In this fast-paced world, ...