topic Re: Multisite replication factor issues in Deployment Architecture

Multisite replication factor issues

jpillai — Tue, 16 Nov 2021 10:29:19 GMT

Hi all,

I'm testing multisite indexer clustering with below configuration and found an undesired behaviour in the case of a site failure.

available_sites = site1,site2
site_replication_factor = origin:2,site1:2,site2:2,total:4
site_search_factor = origin:1,site1:1,site2:1,total:2

As you can see I have configured the replication factor "origin:2,site1:2,site2:2,total:4" so that I will have 2 replicas in both sites. But, in the case of a site failure, I am observing that splunk will try to replicate locally in the site that is up and complete the 'total:4' condition. I think this can be a problem when the available disk space on the machines is less.

Let's say site2 indexer machines are at 80% disk space usage and site1 fails - now when splunk tries to create 4 replicas in the same site (site2) due to site failure, it can easily exhaust the disks.

As per update from splunk support, this is default behaviour, but I feel there needs to be additional control over this. Any advise or suggestions around this issue will be really helpful. Thank you.

Re: Multisite replication factor issues

jpillai — Tue, 16 Nov 2021 10:32:53 GMT

Update: The kind of failures we are usually expecting are network failures, where the failed site will be back in few hours. In the mean time we might not want 4 replicas in the same site that is up. Or in case we need additional replicas in any case, we want to do it manually

Re: Multisite replication factor issues

flotridai — Sun, 23 Jul 2023 12:10:12 GMT

We have the exact same issue: 2 Sites with

site_replication_factor = origin:2,total:4

When a site is down (for example for desaster recovery test purposes or due to a datacenter/region outage), the other site starts to replicating everything to match total:4. Splunk even moves data to frozen to get to the point that this replication factor is matched again. So you can lose data due to this behavior ☹️

Is there a possibility to tell splunk that the maximum replication-factor per site must be 2, not 4?

Re: Multisite replication factor issues

isoutamo — Sun, 23 Jul 2023 14:22:13 GMT

based on https://docs.splunk.com/Documentation/Splunk/9.1.0/Indexer/Sitereplicationfactor it shouldn’t work like this. You could try to add site1:2,site2:2 to site_replication_factor. Then based on docs it should work. When another site is down it should store 2 buckets on current/origin site and report that SRF cannot met. If it do something else you should report a bug to splunk support.

As docs are not crystal clear about this you could also ask that they clarify this situation into docs

r. Ismo

Re: Multisite replication factor issues

jpillai — Mon, 24 Jul 2023 04:05:56 GMT

@flotridai I think you just need to explicitly specify each sites replication factor as below.

site_replication_factor = origin:2,site1:2,site2:2,total:4

In my case, the status of indexes were mistakenly understood as splunk trying to achieve RF 4 on the local site, rather it was just showing that half the replicas are missing and no replication was in progress.