How to handle Multisite Cluster Master failure?

MMCC · ‎03-04-2020

Hi all,

after checking the documentation I am still not much wiser.
I hope some of you have perhaps encountered the same issue and has found a solution.

Following scenario:

2 sites both in europe defined as multisite Indexer Cluster
Site 1 has the Cluster Master (CM) configured for the multisite IXC
Site 2 is not maintained by you (external) a VM-Copy / snapshot etc. is not possible

Site 1 (with the CM) of the two sites goes down or is unavailable to the other.
Within the documentation it is specified that the CM should be setup immediately.

If the site holding the master node
fails, you lose the master's
functionality. You must immediately
start a new master on one of the
remaining sites.

Configure a stand-by master on at
least one of the sites not hosting the
current master.
When the master site
goes down, bring up a stand-by master
on one of the remaining sites

My questions:

Why can't the master on site 2 be up and running? It has to be "brought" up...
Re-configuring all peers to the new master takes time! How can this be avoided?
If the "failover" CM is not known to the peers can it still be up?

For references check following pages:

https://docs.splunk.com/Documentation/Splunk/8.0.2/Indexer/Mastersitefailure
https://docs.splunk.com/Documentation/Splunk/8.0.2/Indexer/Handlemasternodefailure
https://docs.splunk.com/Documentation/Splunk/8.0.2/Indexer/Whathappenswhenamasternodegoesdown

Thank you in advance for any hints or advices

Kind regards

Marco

MMCC · ‎03-05-2020

In the following documentation this part is what I can't find a solution too:

It is still managed by a single
cluster master node, which has to be
failed over to the DR site in case of
a disaster.

Failure of Management functions need
to be handled outside of Splunk in
case of site failure

https://www.splunk.com/pdfs/technical-briefs/splunk-validated-architectures.pdf

nickhills · ‎03-04-2020

Hi @MMCC

Why can't the master on site 2 be up and running? It has to be "brought" up...

A Multisite Cluster can only have one Cluster Master running at a time. The master is the component responsible for co-ordinating all the actions of the cluster, therefore, you can not have two.

reconfiguring all peers to the new master takes time! How can this be avoided?

You should have a standby master (which means a Splunk instance with the same splunk.secret, same cluster shared key, same config, same apps etc) but offline (vm shutdown).

Ideally, your cluster peers should use a DNS name (not a IP address) to reference the master.

To bring the standby master online, change the DNS records to reflect the IP of the standby master, and start the VM.

If the "failover" CM is not known to the peers can it still be up?

If the standby CM has the same splunk.secret, and the same cluster shared key, and the same master apps, then the peers will accept it as the previously running CM.

A word of Caution
MAKE SURE YOU KEEP THE MASTER APPS IN SYNC.

If you make changes to the master apps on the primary CM, (in particular indexes.conf) make sure you copy the changes to the standby.
If you fail to do this and you have an index defined in your primary which is NOT defined in the standby, when the standby master comes online it will remove the missing index from your peers. This is not fun. Don't let it happen to you.

If my comment helps, please give it a thumbs up!

MMCC · ‎03-05-2020

Hi @nickhillscpl

Thank you for getting in touch so quickly.

You provided me with one solution regarding the topic of re-configuring the CM to it's peers.
I'm not sure if that will be possible with the failover site not in our administration. I'll verify that.

Only thing I can't wrap my head around is, how to sync the data when the vm of the standby is down... Shouldn't that vm be at least running to copy the necessary files to the instance?

I can now only think of a drive (with splunk installed) is mounted to the vm on boot time and the files are synced there. So when the vm comes up the failover CM is aligened.
Is that about right?

Thanks in advance for any additional feedback

nickhills · ‎03-05-2020

That’s one approach.
The other is be 100% sure your standby is not in dns, start it, sync files, shut down.

The reason I suggest keeping it offline is to reduce any chance of it being confused with the primary, but another approach is to keep the host running with Splunk shutdown.

If my comment helps, please give it a thumbs up!

MMCC · ‎03-05-2020

Thanks again for replying.

In our scenario the NOC / Service Desk (SD) does not have direkt access to the machine with the "failover" CM.

The VM is part of a cloud service that we can't control... With that in mind Splunk would have to be up and running, as no one can access the machine and start Splunk on the VM.

The action that the NOC / SD would have to perform should be very simple, as they will most likely have other problems to focus on. It's unfortunate that when the site goes down the monitoring fails too 🙂

How to handle Multisite Cluster Master failure?

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Introducing ITSI 5.0: Unified Visibility and Actionable Insights

Inside Splunk Agent Observability: Understanding Agent Behavior, Tokens & Costs

From Data to Insight: Announcing the Winners of the Splunk Dashboard Contest

Join the Conversation

How to handle Multisite Cluster Master failure?

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Introducing ITSI 5.0: Unified Visibility and Actionable Insights

Inside Splunk Agent Observability: Understanding Agent Behavior, Tokens & Costs

From Data to Insight: Announcing the Winners of the Splunk Dashboard Contest