Let´s assume you have a multi-site indexer cluster with 2 sites, 3 indexers each and the following RF/SF. site_replication_factor = origin:2, total:4 site_search_factor = origin:2, total:4 So for each indexer getting data in, there is one site-local replication copy and two remote site copies. When one indexer becomes unavailable, the replication switches to the 2 remaining indexers on that site and everything is still working as before. But what happens in case of 2 unavailable indexers on one site or a complete site failure (the site without the master node)? As far as i understand, events should still be received by the remaining indexers (due to indexer discovery and load balancing of the forwarders), indeed replication factor is not met (because 2 copies on the site with failure is not possible anymore at this time), but local indexing and local site replication will still happen without any interruption or manual steps necessary (when master is still available and will not be restarted) and also searching should be fully possible, as there is at least one searchable copy of each bucket on the remaining indexers. The indexer cluster / master node knows, that there are buckets with pending replication tasks (as RF/SF is not met, because not enough target indexers are available), but everything is still working and when the indexers/site is back, this will be fixed automatically and after some time, the replication factor and search factor is met again. This is called "disaster recovery" in the Splunk documentation and what anyone could expect from "high-availability" imho. So is this explanation and my understanding in theory correct and what one can expect? Or are there any doubts or am I wrong in some details? Are there any real-world practical experiences that this is fully working or that there are any problems/errors or exceptions in any cases?
... View more