Deployment Architecture

Multisite replication factor issues

jpillai
Path Finder

Hi all,

I'm testing multisite indexer clustering with below configuration and found an undesired behaviour in the case of a site failure.

available_sites = site1,site2
site_replication_factor = origin:2,site1:2,site2:2,total:4
site_search_factor = origin:1,site1:1,site2:1,total:2

As you can see I have configured the replication factor "origin:2,site1:2,site2:2,total:4" so that I will have 2 replicas in both sites. But, in the case of a site failure, I am observing that splunk will try to replicate locally in the site that is up and complete the 'total:4' condition. I think this can be a problem when the available disk space on the machines is less.

Let's say site2 indexer machines are at 80% disk space usage and site1 fails - now when splunk tries to create 4 replicas in the same site (site2) due to site failure, it can easily exhaust the disks.

As per update from splunk support, this is default behaviour, but I feel there needs to be additional control over this. Any advise or suggestions around this issue will be really helpful. Thank you.

Labels (1)

flotridai
Engager

We have the exact same issue: 2 Sites with

site_replication_factor = origin:2,total:4

When a site is down (for example for desaster recovery test purposes or due to a datacenter/region outage), the other site starts to replicating everything to match total:4. Splunk even moves data to frozen to get to the point that this replication factor is matched again. So you can lose data due to this behavior ☹️

Is there a possibility to tell splunk that the maximum replication-factor per site must be 2, not 4?

0 Karma

jpillai
Path Finder

@flotridai  I think you just need to explicitly specify each sites replication factor as below.

site_replication_factor = origin:2,site1:2,site2:2,total:4

 In my case, the status of indexes were mistakenly understood as splunk trying to achieve RF 4 on the local site, rather it was just showing that half the replicas are missing and no replication was in progress.

0 Karma

isoutamo
SplunkTrust
SplunkTrust

Hi

based on https://docs.splunk.com/Documentation/Splunk/9.1.0/Indexer/Sitereplicationfactor it shouldn’t work like this. You could try to add site1:2,site2:2 to site_replication_factor. Then based on docs it should work. When another site is down it should store 2 buckets on current/origin site and report that SRF cannot met. If it do something else you should report a bug to splunk support.

As docs are not crystal clear about this you could also ask that they clarify this situation into docs  

r. Ismo

0 Karma

jpillai
Path Finder

Update: The kind of failures we are usually expecting are network failures, where the failed site will be back in few hours. In the mean time we might not want 4 replicas in the same site that is up. Or in case we need additional replicas in any case, we want to do it manually

Get Updates on the Splunk Community!

Earn a $35 Gift Card for Answering our Splunk Admins & App Developer Survey

Survey for Splunk Admins and App Developers is open now! | Earn a $35 gift card!      Hello there,  Splunk ...

Continuing Innovation & New Integrations Unlock Full Stack Observability For Your ...

You’ve probably heard the latest about AppDynamics joining the Splunk Observability portfolio, deepening our ...

Monitoring Amazon Elastic Kubernetes Service (EKS)

As we’ve seen, integrating Kubernetes environments with Splunk Observability Cloud is a quick and easy way to ...