Multisite replication factor issues

jpillai · ‎11-16-2021

Hi all,

I'm testing multisite indexer clustering with below configuration and found an undesired behaviour in the case of a site failure.

available_sites = site1,site2
site_replication_factor = origin:2,site1:2,site2:2,total:4
site_search_factor = origin:1,site1:1,site2:1,total:2

As you can see I have configured the replication factor "origin:2,site1:2,site2:2,total:4" so that I will have 2 replicas in both sites. But, in the case of a site failure, I am observing that splunk will try to replicate locally in the site that is up and complete the 'total:4' condition. I think this can be a problem when the available disk space on the machines is less.

Let's say site2 indexer machines are at 80% disk space usage and site1 fails - now when splunk tries to create 4 replicas in the same site (site2) due to site failure, it can easily exhaust the disks.

As per update from splunk support, this is default behaviour, but I feel there needs to be additional control over this. Any advise or suggestions around this issue will be really helpful. Thank you.

flotridai · ‎07-23-2023

We have the exact same issue: 2 Sites with

site_replication_factor = origin:2,total:4

When a site is down (for example for desaster recovery test purposes or due to a datacenter/region outage), the other site starts to replicating everything to match total:4. Splunk even moves data to frozen to get to the point that this replication factor is matched again. So you can lose data due to this behavior ☹️

Is there a possibility to tell splunk that the maximum replication-factor per site must be 2, not 4?

jpillai · ‎07-23-2023

@flotridai I think you just need to explicitly specify each sites replication factor as below.

site_replication_factor = origin:2,site1:2,site2:2,total:4

In my case, the status of indexes were mistakenly understood as splunk trying to achieve RF 4 on the local site, rather it was just showing that half the replicas are missing and no replication was in progress.

isoutamo · ‎07-23-2023

Hi

based on https://docs.splunk.com/Documentation/Splunk/9.1.0/Indexer/Sitereplicationfactor it shouldn’t work like this. You could try to add site1:2,site2:2 to site_replication_factor. Then based on docs it should work. When another site is down it should store 2 buckets on current/origin site and report that SRF cannot met. If it do something else you should report a bug to splunk support.

As docs are not crystal clear about this you could also ask that they clarify this situation into docs

r. Ismo

jpillai · ‎11-16-2021

Update: The kind of failures we are usually expecting are network failures, where the failed site will be back in few hours. In the mean time we might not want 4 replicas in the same site that is up. Or in case we need additional replicas in any case, we want to do it manually

Multisite replication factor issues

indexer clustering

Introducing the 2024 SplunkTrust!

Introducing the 2024 Splunk MVPs!

Splunk Custom Visualizations App End of Life