Here are my responses to your questions and comments. This is definitely a very helpful discussion! Q: What is the recovery time objective for the standby site? IOW, for how long can you be without Splunk? A: RTO for DR is 1 hour, so Splunk would need to be up and running again with an hour to meet our Availability SLA. Are these servers physical or virtual? If virtual, you may be able to use features of the VM provider (VMware, for instance) to handler server recovery for you. A: These are all Oracle Cloud VM's running on RHEL 7.x. You raise a good point about server recovery, so I'll check with the right folks about that. Using rsync to keep Splunk instances in synch is an incomplete solution at best. This is partly why Splunk developed clustering (although clustering is intended as a scaling solution, not a DR one). Anything that uses KVStore (which includes ES) will not be replicated by rsync. A: Search head clustering was recommended, but we're low on available CPU, so we're holding off on SH clustering just yet. As we begin ingesting data in the future and authorizing more users to access Splunk, we'll be in a better position to get the necessary resources for a SHC. We just need 1 more ES SH and 1 more ad-hoc SH for 2 SH clusters. Splunk configurations should be managed in an external CM solution such as git. When it's time to start the standby system(s), load the latest config from CM first. A: That makes sense. Since you will have an indexer cluster spanning sites, consider having a search cluster that also spans those sites. A cluster will keep itself in sync and spreads the search load across all members of the cluster, making better use of limited resources. Managing a cluster is different from managing a single SH, but (IMO) it's easier to manange a SHC than it is to manage 4 independent SHs. A: That makes sense. Once we have the resources to get a SHC built out, it shall cover both sites. To make cut-over easier, use DNS to route traffic to the DS, CM, and LM (license master, which was not mentioned previously, but is necessary). When the primary site goes down, DNS will redirect requests to the other site without having to change configurations in indexers and forwarders. A: That's a great point.
... View more