Deployment Architecture

Splunk v5 Clustering and HA

Drainy
Champion

I have a scenario and was wondering if somebody could confirm what would happen...

Lets pretend we're the Federation, obviously we have a lot of data across the galaxy but we're really quite interested in our local star system.
To that end we've decided to install Splunk v5 (they obtained it after a time travelling related accident). So on the Enterprise we install an indexer, Excelsior also has an Indexer, Spacedock (in orbit of earth) has two indexers (its generating quite a lot of data).
Finally, on the moon we have another two indexers.
On earth we have a couple of search heads dotted around. Because we've installed v5 we decide to to setup a cluster to make sure we have HA across the fleet, the master is located on a dedicated terminal on the Spacedock.

All works well for a week when suddenly, a giant cylindrical thing with a giant floating ball and crazy whale noises appears, no one has any idea whats going on when suddenly Spacedock loses all power.
Our master and two indexers are taken offline.

According to the docs the search heads will continue to try and search across its previously known indexers, so in this case HA has actually failed and we don't have any redundancy from entire site failures if the master is located on that site. Is that correct? Is there any way to mitigate or protect against this? (Short of sticking the master on a satellite)

Thanks for any opinions or views, the more the merrier.

1 Solution

Vishal_Patel
Splunk Employee
Splunk Employee

I'll try to summarize your question to make sure I have it right. If my master and 2 peers fail in a cluster with replication factor=3, what will happen.

In this case, the cluster won't be able to take corrective action to recover until the master (original or a new one) is brought back into the cluster. Although there are some plans of making master redundant in future release, in 5.0, there is no notion of multiple masters. However, one nice property of the 5.0 master is that it persists no data, if your master completely blows up, you just have to stand up a separate machine configured w/ the exact same clustering stanza in server.conf, and as long as the master_uri from the peer's/searchead's point of view doesn't change, the new master will be able to reconstruct state once all peers have registered themselves against it. This fact can be used to set up a fail-over master node w/ dns/virtual ip tricks; this of course is not first class support for master redundancy, but may be a suitable work around for some folks

View solution in original post

piebob
Splunk Employee
Splunk Employee

Drainy
Champion

Good recommendation, at the moment they are using teleporters to action scripted inputs from remote probes

jonuwz
Influencer

You've reached the same conslusion as me.

Splunk has a tendency of making key components non-HA-able (master node, license master) and that doesn't lend itself to enterprises (pun-intended) that need to operate multiple redundant datacentres.

The kicker with your problem is that even if there's searchable copies of the spaceport buckets on an indexer outside of spaceport (which isn't guaranteed unless searchfactor=3), since the master is down, the heads wont be able to access the data.

Why ?

1) If a peer goes down, the master assigns primacy to copies of the missing buckets on other peers.

2) The master rolls the generation id. (The generation tells the peers which buckets are marked as primary)

3) The heads get the new generation ID

Now, given a peer only searches buckets marked as primary, if 1 + 2 dont happen, the copies of the missing buckets will never be searched.

In other words, you're stuffed.

Splunks HA implementation needs to support multiple masters without resorting to expensive cross-site SAN replication and DNS trickery. It really needs to be able to set affinities to prevent replication occurring within the same virtual host / server frame / datacentre / region.

These are features in other enterprise-grade distributed software, but not necessarily in v1 of the HA enabled release.

Drainy
Champion

Agreed on all points, except for the fact that its not enterprise grade. It really depends on the enterprise, a great many of the ones that I dealt with previously don't even run multiple sites, the biggest concern is planning for a server outage. In that scenario this would be suitable and I believe there is even a customer out there with around 100 odd indexers using it, for our problems, its not going to work as it is.

0 Karma

Vishal_Patel
Splunk Employee
Splunk Employee

I'll try to summarize your question to make sure I have it right. If my master and 2 peers fail in a cluster with replication factor=3, what will happen.

In this case, the cluster won't be able to take corrective action to recover until the master (original or a new one) is brought back into the cluster. Although there are some plans of making master redundant in future release, in 5.0, there is no notion of multiple masters. However, one nice property of the 5.0 master is that it persists no data, if your master completely blows up, you just have to stand up a separate machine configured w/ the exact same clustering stanza in server.conf, and as long as the master_uri from the peer's/searchead's point of view doesn't change, the new master will be able to reconstruct state once all peers have registered themselves against it. This fact can be used to set up a fail-over master node w/ dns/virtual ip tricks; this of course is not first class support for master redundancy, but may be a suitable work around for some folks

Drainy
Champion

Double bump for any updates?

0 Karma

Drainy
Champion

Bump, just in-case you missed my first comment 🙂

0 Karma

Drainy
Champion

Interesting. So there is no state information kept on the master at all? Does the master query them to gather all details on load?
Whilst a failover could work, in an environment where you need to guarantee continuous operation in the event of a disaster this wouldn't be a suitable HA setup (although I do think its pretty good).
Any idea on when a redundant master could appear? I appreciate you don't make promises but is it something targetted for a maintenance or a full release?

0 Karma

Steve_G_
Splunk Employee
Splunk Employee

You need the master to fix up problems arising from peer node failure. So, if the master and a peer (or two) go down simultaneously, you're not going to be able to recover (and hence search across the entire set of data) until the master restarts and/or you get another master in there to replace it. Suggested pre-remedial action: time travel back an hour or two and move the master to a separate location from the peers.

As a general rule, your master should not run on the same machine as any peer.

Steve_G_
Splunk Employee
Splunk Employee

Bit of a problem there, but as Vishal suggests, you can start up a new master elsewhere, which will then lead to full recovery.

alacercogitatus
SplunkTrust
SplunkTrust

But what happens if the peer and the master are not on same machine, but the entire site goes down? As in, hurricane flooding takes out the entire datacenter?

0 Karma

yannK
Splunk Employee
Splunk Employee

If your indexers from the replication cluster are on very distant planets, make sure to have a connection that can "make point five past lightspeed."

alacercogitatus
SplunkTrust
SplunkTrust

From my very limited knowledge on the subject, as long as you have a search factor of (failed_indexers + 1), you can recover and search with no problem almost instantly. If your search factor is not high enough, you will have to make the non-searchable replications searchable (takes time), and assuming you have replication factor of (failed_indexers + 1)

http://docs.splunk.com/Documentation/Splunk/5.0/Indexer/Thesearchfactor
http://docs.splunk.com/Documentation/Splunk/5.0/Indexer/Thereplicationfactor

The cluster should allow a short window of having the Spacedock down, but won't tolerate that for too long (time unspecified).
http://docs.splunk.com/Documentation/Splunk/5.0/Indexer/Whathappenswhenamasternodegoesdown

alacercogitatus
SplunkTrust
SplunkTrust

Hrm. Apparently I fail at reading. From the docs: If a master goes down, the cluster can continue to run as usual, as long as there are no other failures. So I'm thinking as a "let's get this whale outa here" kinda move, you could copy the replicated hots and warms from the colddb location to the hots and warms on that indexer to bring the searches back up? Not 100%, I have yet to fully play with replication and HA.

0 Karma

Drainy
Champion

But how could it recover? The master coordinates searches by sending details of indexers available to all search heads, if the master if offline then how do the search heads know where to search? (I've spent a week thinking on this 🙂 ) Essentially the issue isn't with who has the data in this case, its how the search heads know to search it

0 Karma

MHibbin
Influencer

I think you secretly desire a career as a sci-fi novel writer.

Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...