Hi all,
I'm testing multisite indexer clustering with below configuration and found an undesired behaviour in the case of a site failure.
available_sites = site1,site2
site_replication_factor = origin:2,site1:2,site2:2,total:4
site_search_factor = origin:1,site1:1,site2:1,total:2
As you can see I have configured the replication factor "origin:2,site1:2,site2:2,total:4" so that I will have 2 replicas in both sites. But, in the case of a site failure, I am observing that splunk will try to replicate locally in the site that is up and complete the 'total:4' condition. I think this can be a problem when the available disk space on the machines is less.
Let's say site2 indexer machines are at 80% disk space usage and site1 fails - now when splunk tries to create 4 replicas in the same site (site2) due to site failure, it can easily exhaust the disks.
As per update from splunk support, this is default behaviour, but I feel there needs to be additional control over this. Any advise or suggestions around this issue will be really helpful. Thank you.
We have the exact same issue: 2 Sites with
site_replication_factor = origin:2,total:4
When a site is down (for example for desaster recovery test purposes or due to a datacenter/region outage), the other site starts to replicating everything to match total:4. Splunk even moves data to frozen to get to the point that this replication factor is matched again. So you can lose data due to this behavior ☹️
Is there a possibility to tell splunk that the maximum replication-factor per site must be 2, not 4?
@flotridai I think you just need to explicitly specify each sites replication factor as below.
site_replication_factor = origin:2,site1:2,site2:2,total:4
In my case, the status of indexes were mistakenly understood as splunk trying to achieve RF 4 on the local site, rather it was just showing that half the replicas are missing and no replication was in progress.
Hi
based on https://docs.splunk.com/Documentation/Splunk/9.1.0/Indexer/Sitereplicationfactor it shouldn’t work like this. You could try to add site1:2,site2:2 to site_replication_factor. Then based on docs it should work. When another site is down it should store 2 buckets on current/origin site and report that SRF cannot met. If it do something else you should report a bug to splunk support.
As docs are not crystal clear about this you could also ask that they clarify this situation into docs
r. Ismo
Update: The kind of failures we are usually expecting are network failures, where the failed site will be back in few hours. In the mean time we might not want 4 replicas in the same site that is up. Or in case we need additional replicas in any case, we want to do it manually