Best Practice Setup for a Disaster Recovery Site:
primary site went down.
DR site: setup with similar setup hardware in VMs, and splunk server.
if primary is down, i need the DR site to be up and functional How does the license work?
-Can I load the license into the Splunk DR site and receive data from forwarders?
-If it is loaded, can I view old data, provide that I copied the indexes db over to the DR site?
or any suggestion on setting up DR site.
In Primary site I have 4 physical Splunk servers:
First server has 2 Splunk installations: one as a Master/License Splunk instance , one as a deployment server Splunk instance.
Second server has 1 Splunk instance as a Heavy forwarder.
Third server has a 1 Splunk instance as a Indexer
Fourth server has 1 Splunk instance as a Search head.
In our Disaster site I have 2 physical Splunk servers:
First server has 1 Splunk instance as a indexer.
Both Indexers (1 in primary site and 1 in disaster site) are in a normal cluster (no multisite because I only have 2 indexers and not 3).
Configured with replication factor 2 and search factor 2.
All my indexed data will be available and searchable in our primary site an in our disaster site.
The second server in our disaster site is our test/acceptance/development AND disaster server.
It contains in total 11 Splunk instances!!
1 development Splunk instance with free license. Just to play with data.
6 copies of our production Splunk instances (1 Master/Lic, 1 Deploy, 2 Indexers, 1 Heavy Forwarder, 1 Search Head). Just to simultae our production environment and to test new releases, clustering and all the configurations. All combined in 1 server!
These Splunk instances has is own License of 500MB.
All using different port numbers ofcourse!
4 Splunk instances as disaster recover instances. 1 Master/Lic, 1 Deploy, 1 Heavy Forwarder and 1 Search Head.
We are synchronizing all the data (except ignore = var/log/splunk/, var/log/splunk/., var/log/introspection/, var/log/introspection/., var/lib/splunk/, var/lib/splunk/., var/run/splunk/, var/run/splunk/.) from our 4 production Splunk instances to our disaster Splunk instances.
No need to replicate the indexer data because that's allready taken care by our production cluster settings.
We are using a strecthed vlan.
In case of 1 server down in our production site we are starting the corresponding Splunk instances in our disaster site on our disaster server.
In case of production site down, we start the 4 disaster production Splunk instances on our disaster server.
They are using the same ip-addresses and port numbers as our production servers.
Now we are using our original production license.
Now we have 1 indexer (with replicated and searchable data) left in our disaster site.
In case of a disaster we are able to switch to our disaster site within say 10 minutes , with all indexed data available and searchable.
Ofcourse you can shutdown the sandbox environment and all test/development Splunk instances in case of less performance.
Yes, you can use the same license for primary and DR site. Essentially if it is the same data then you will turn on multisite clustering, configure your sites and have splunk copy the data between sites using the same license.
No need to forward the data to two different locations and No need to have 2x licenses. If you use multisite clustering, then you need to forward the data to only one location and Splunk replication will take care of replicating the data to DR site.
May I send data to two locations using auto loadbalacing? For availability if one of sites is down another site can recieve data - I think it is reasonable. But we saw problems with hot buckets when one of sites is down buckets rolls to warm very fast according this link http://docs.splunk.com/Documentation/Splunk/6.2.2/Indexer/Bucketreplicationissues.
So I have next question what is the maximum value for max_replication_errors and what is its influence to performance?
To forward data to secondary site, when the primary site is down, we have a new feature in 6.6 that can handle this case automatically.
Disaster Recovery is only as good as the money you throw at it. A simple DR might include setting up alot of indexers (1/2 at main site, 1/2 at DR site) and two separate search heads (one at each site) that search all of the indexers. You can then use Splunk Replication and keep enough copies everywhere to withstand 1/2 of the nodes going down. That being said - You probably want to see what is best for your environment.
Here is something about multi-site clustering: http://docs.splunk.com/Documentation/Splunk/6.1.3/Indexer/Multisiteclusters
To make it work in a DR situation - you would configure forwarders to send to two different locations. You would then also need to have 2x license to handle it. Expensive. Or just rsync the indexes every few minutes. Again - very dependent on how much Ca$h you have, and how well you handle network latency and other issues.
Or you could even just rsync indexes to a complete mirror of your production, and put a search head in front of it. But you have to have the bandwidth to support it.