Hi all,
We are struggling to get our Splunk architecture proposal approved by our internal review process and one of the questions they keep raising is about our data resiliency plan.
This is our proposal on a very high level:
Our review board is suggesting why not 4 copies instead of 6, as in 2, 1, 1 (origin 2, 4 in total)?
Reasons to keep 6 copies of our data across 3 sites and 9 indexers.
PROS:
CONS:
* Cost (33% more storage required) and extra network bandwidth
Are there any other benefits of having 6 copies of our data instead of 4 that I can use to justify the extra cost?
Performance maybe? Anything else?
Thanks,
J
@javiergn The advantage of having two copies at each site, as you noted, is that you can have 2 remote sites completely down AND sustain a single local indexer failure. However, you have to ask yourself, what is the likelihood of loosing two remote datacenters AND a local indexer?
With the configuration of 2 local copies and two remote copies, you could sustain an indexer failure at any of your sites and still maintain searchability. The difference is that if an indexer fails at a site that originated the bucket, Splunk can re-replicate (bucket fix-up) the data intra-site vs. at a remote site, we'd have to reach across the WAN to re-replicate the data.
You're not going to get any peformance benefits from having multiple copies at each site because only a single copy of each bucket is primary (searchable) at each site....regardless of how many copies there are. The only potential performance gain is that bucket fix-up will happen faster because we're not replicating data across the WAN for sites that only have 1 copy of each bucket.
Make sense?
@javiergn The advantage of having two copies at each site, as you noted, is that you can have 2 remote sites completely down AND sustain a single local indexer failure. However, you have to ask yourself, what is the likelihood of loosing two remote datacenters AND a local indexer?
With the configuration of 2 local copies and two remote copies, you could sustain an indexer failure at any of your sites and still maintain searchability. The difference is that if an indexer fails at a site that originated the bucket, Splunk can re-replicate (bucket fix-up) the data intra-site vs. at a remote site, we'd have to reach across the WAN to re-replicate the data.
You're not going to get any peformance benefits from having multiple copies at each site because only a single copy of each bucket is primary (searchable) at each site....regardless of how many copies there are. The only potential performance gain is that bucket fix-up will happen faster because we're not replicating data across the WAN for sites that only have 1 copy of each bucket.
Make sense?
Hi, thanks for your quick response.
I guess I'm also concerned about the maintenance implications.
Each search head will only search locally because of the multisite clustering. Therefore if we only keep one single copy of our data per site, that is, across the 3 local indexers, if I want to patch an indexer or there's an unexpected failure, that will invalidate the whole site completely because the other two remaining indexers only have 66% of our data on average.
Therefore any search still running there will get compromised as the outcome won't be reliable anymore. Same for scheduled searches.
Is that a valid assumption?
If you're planning on deploying multi-site index clustering with Search Head affinity (where a SH only uses the local indexers for searches) and you have a local failure or bring down a local indexer, the Search Head will automatically reach out to another site to fulfill search requests if there are no searchable buckets locally. So even for sites with a RF:1, SF:1, you can still fulfill search results by reaching out to another site until we can fix-up the local buckets.
In 6.3, we also introduced the ability to turn off Search Head affinity so that all indexers across all sites participate in searches. This obviously requires that you have decent bandwidth and low latency between sites.