Deployment Architecture

Why does master node continue to be single point of failure in clustering ?

Contributor

Why does master node continue to be a single point of failure in clustering ?
Or is there any plan to be HA in master node ?

Tags (2)
0 Karma
1 Solution

Splunk Employee
Splunk Employee

It is not a single point of failure. If the cluster master is lost, there is no effect on operation of the cluster, either on searching or indexing. The effect of a failed cluster master is that, if an (additional) indexer node fails before the cluster master is recovered, then searches will be affected and data will be indicated as unsearchable.

Note that a cluster master is stateless, so recovery does not need any kind of data recovery, but simply starting up a replacement at the same address as the missing one. This can be done with a simple boot of a plain master machine image.

View solution in original post

Explorer

Hi all,
this is an old question but it is troubeling me even now. I have to disagree with gkanapathys answer.
I do think that the Cluster Master is a single point of failure.

https://docs.splunk.com/Documentation/Splunk/8.0.2/Indexer/Handlemasternodefailure

Although there is currently no master
failover capability, you can prepare
the indexer cluster for master failure
by configuring a stand-by master that
you can immediately bring up if the
primary master goes down. You can use
the same method to replace the master
intentionally.

https://docs.splunk.com/Documentation/Splunk/8.0.2/Indexer/Mastersitefailure

If the site holding the master node
fails, you lose the master's
functionality. You must immediately
start a new master on one of the
remaining sites.

This means that the functionality contained within the CM is really important.
Did anyone find a solution on how to mitigate the failover of the master?
Bringing up the stand by CM takes time why can't it run parallel?

0 Karma

Ultra Champion

Your comment:

Bringing up the stand by CM takes time why can't it run parallel?

What would you do if you had a standby CM running in parallel, and the primary failed?
Surely you would expediently replace it - that process would take the same amount of time as simply replacing a single primary.

The fact that the CM is stateless means recovery is a straightforward process and does not depend on a lengthy task to restore it.

I fully advocate keeping a recent backup of your instances for this eventuality, but a CM failure is not a breaking failure (in the same way as loosing a single properly spec'd idx peer or SHC member is not immediately breaking)

Certainly you should have a plan to recover a Splunk instance, but aside from the specific commands to add your replacement instance to its relevant idx or SHC cluster, its a very similar process.

0 Karma

Explorer

@nickhillscpl wow thanks for also replying to my questions in this post.

perhaps I'm viewing this more grave then it actually is. still it is unsatisfying that there is no standard/internal capability for this important role.

0 Karma

Ultra Champion

I think if you read that last paragraph as

If the site holding the master node fails, you lose the master's functionality. You SHOULD immediately start a new master on one of the remaining sites.

The original answer stands. There is no immediate impact to the remaining cluster peers, the indexers will continue to operate as normal, however it would make sense to replace the failed component as soon as possible to prevent further failures causing disruption.

In a RAID 5 diskset, you can afford to loose a single device, and your data is preserved. You must/should replace the failed disk as soon as possible to prevent further failure causing data loss.

Same thing applies.

0 Karma

Splunk Employee
Splunk Employee

It is not a single point of failure. If the cluster master is lost, there is no effect on operation of the cluster, either on searching or indexing. The effect of a failed cluster master is that, if an (additional) indexer node fails before the cluster master is recovered, then searches will be affected and data will be indicated as unsearchable.

Note that a cluster master is stateless, so recovery does not need any kind of data recovery, but simply starting up a replacement at the same address as the missing one. This can be done with a simple boot of a plain master machine image.

View solution in original post

Contributor

I have a question around this: As per docs SH query Master to get list of peers and direct request to them. In case of Master failure how search will continue??