I am preparing to migrate my Splunk data storage to AWS S3 using Smart Store. My S3 buckets will be replicated across regions in AWS for failover and I have a requirement to fully test that capability. In theory, all of the data in my cluster should replicate to my failover region and I should be able to simply point my new Splunk instances to it and go. Has anyone ever tested this out and have insights into what will go wrong?
I'm not sure I fully understand your use case.
You have an indexer cluster which is using SmastStore (s2) storage. For the purpose of simplicity, lets assume this deployment is in a regional geography (A) and you have located your AWS S3 bucket in the same region. (Maybe your indexers are EC2 instances, maybe not)
You want to replicate the s2 data into AWS region B.
What I am unclear on, is under what conditions and locations would you try to access the s2 data in region B?
Whilst it is true that AWS can have disruptions, Amazon S3 provides 11 9's of data durability. This is pretty much as much as is ever going to be practical in terms of "data durability" or to put in another way, adding a replica of your s2 data will have almost no benefit to the safety of your data (from a loss perspective).
If you are trying to architect around the fact that an AWS region could fail (and be unreachable for a significant period) and you would intend to point your region A IDX cluster at the s2 data in region B. DON'T DO THIS
Not only will your access be slow, but AWS S3 replication is only "eventually consistent" this means that it is very probable that your region B data is behind the last version of the data in region A before the fault developed.
Also, replication is not bidirectional. Were you to attach the remote region for s2, new data indexed in region B would be stuck there. It will not replicate back.
In addition, when the fault clears in region A, the missing data from A will be written to B, who knows what untold mess this will bring. Finally, in a situation where AWS S3 is unavailable in your "home" region, its quite probable that you wont be able to access the data in AWS S3 region B, from the same source. (AWS use all sorts of mechanisms to route your access via their infrastructure - there is a high possibility that a regional AWS failure would impact all access to all AWS S3 data from that geography)
If your primary driver is to be able to tolerate a failure in the s2 storage tier, then architect that tolerance within Splunk.
Deploy a multisite cluster -
Site 1: idx in region A, s2 storage in AWS region A
Site 2: idx in region B, s2 storage in AWS region B
That way if you encounter a fault in AWS region A, your peers in region B can still provide search results. It also means that when the fault clears, any inflight (but committed) data in your indexers will be cleanly written to SmartStore in the affected region.
A good rule of thumb with SmartStore - don't mess with the workings of SmartStore.
Rolling your own storage replication would be ill advised (at least without Splunk PS oversight)
just a note, I have used confusing terminology in the above.
I use s2 (lowercase s) to refer to Splunk Smart Store
I use S3 (capital S) to refer to AWS Simple Storage Service.
I hope that makes sense.
Thanks for the insight. I had a feeling the real issue would be something having to do with when AWS region A turns back on.
To clarify the why behind this, I have a business requirement to reduce the cost of Splunk and Smart Store seems like an obvious choice for an on-prem solution given that most of my costs are disk related. I also work in a heavily regulated industry and we have a requirement to simulate a regional AWS failure for a significant period of time regardless of AWS' actual up-time.
Given the cost considerations, I don't think I'll be going with a muti-region cluster. However, given the eventually consistent replication of S3, I am going to need to rethink this approach.