I have a scenario and was wondering if somebody could confirm what would happen...
Lets pretend we're the Federation, obviously we have a lot of data across the galaxy but we're really quite interested in our local star system.
To that end we've decided to install Splunk v5 (they obtained it after a time travelling related accident). So on the Enterprise we install an indexer, Excelsior also has an Indexer, Spacedock (in orbit of earth) has two indexers (its generating quite a lot of data).
Finally, on the moon we have another two indexers.
On earth we have a couple of search heads dotted around. Because we've installed v5 we decide to to setup a cluster to make sure we have HA across the fleet, the master is located on a dedicated terminal on the Spacedock.
All works well for a week when suddenly, a giant cylindrical thing with a giant floating ball and crazy whale noises appears, no one has any idea whats going on when suddenly Spacedock loses all power.
Our master and two indexers are taken offline.
According to the docs the search heads will continue to try and search across its previously known indexers, so in this case HA has actually failed and we don't have any redundancy from entire site failures if the master is located on that site. Is that correct? Is there any way to mitigate or protect against this? (Short of sticking the master on a satellite)
Thanks for any opinions or views, the more the merrier.
From my very limited knowledge on the subject, as long as you have a search factor of (failedindexers + 1), you can recover and search with no problem almost instantly. If your search factor is not high enough, you will have to make the non-searchable replications searchable (takes time), and assuming you have replication factor of (failedindexers + 1)
The cluster should allow a short window of having the Spacedock down, but won't tolerate that for too long (time unspecified).
But how could it recover? The master coordinates searches by sending details of indexers available to all search heads, if the master if offline then how do the search heads know where to search? (I've spent a week thinking on this 🙂 ) Essentially the issue isn't with who has the data in this case, its how the search heads know to search it
Hrm. Apparently I fail at reading. From the docs:
If a master goes down, the cluster can continue to run as usual, as long as there are no other failures. So I'm thinking as a "let's get this whale outa here" kinda move, you could copy the replicated hots and warms from the colddb location to the hots and warms on that indexer to bring the searches back up? Not 100%, I have yet to fully play with replication and HA.
If your indexers from the replication cluster are on very distant planets, make sure to have a connection that can "make point five past lightspeed."
You need the master to fix up problems arising from peer node failure. So, if the master and a peer (or two) go down simultaneously, you're not going to be able to recover (and hence search across the entire set of data) until the master restarts and/or you get another master in there to replace it. Suggested pre-remedial action: time travel back an hour or two and move the master to a separate location from the peers.
As a general rule, your master should not run on the same machine as any peer.
But what happens if the peer and the master are not on same machine, but the entire site goes down? As in, hurricane flooding takes out the entire datacenter?
Bit of a problem there, but as Vishal suggests, you can start up a new master elsewhere, which will then lead to full recovery.
I'll try to summarize your question to make sure I have it right. If my master and 2 peers fail in a cluster with replication factor=3, what will happen.
In this case, the cluster won't be able to take corrective action to recover until the master (original or a new one) is brought back into the cluster. Although there are some plans of making master redundant in future release, in 5.0, there is no notion of multiple masters. However, one nice property of the 5.0 master is that it persists no data, if your master completely blows up, you just have to stand up a separate machine configured w/ the exact same clustering stanza in server.conf, and as long as the master_uri from the peer's/searchead's point of view doesn't change, the new master will be able to reconstruct state once all peers have registered themselves against it. This fact can be used to set up a fail-over master node w/ dns/virtual ip tricks; this of course is not first class support for master redundancy, but may be a suitable work around for some folks