Deployment Architecture

Multisite replication factor issues

jpillai
Path Finder

Hi all,

I'm testing multisite indexer clustering with below configuration and found an undesired behaviour in the case of a site failure.

available_sites = site1,site2
site_replication_factor = origin:2,site1:2,site2:2,total:4
site_search_factor = origin:1,site1:1,site2:1,total:2

As you can see I have configured the replication factor "origin:2,site1:2,site2:2,total:4" so that I will have 2 replicas in both sites. But, in the case of a site failure, I am observing that splunk will try to replicate locally in the site that is up and complete the 'total:4' condition. I think this can be a problem when the available disk space on the machines is less.

Let's say site2 indexer machines are at 80% disk space usage and site1 fails - now when splunk tries to create 4 replicas in the same site (site2) due to site failure, it can easily exhaust the disks.

As per update from splunk support, this is default behaviour, but I feel there needs to be additional control over this. Any advise or suggestions around this issue will be really helpful. Thank you.

Labels (1)

flotridai
Engager

We have the exact same issue: 2 Sites with

site_replication_factor = origin:2,total:4

When a site is down (for example for desaster recovery test purposes or due to a datacenter/region outage), the other site starts to replicating everything to match total:4. Splunk even moves data to frozen to get to the point that this replication factor is matched again. So you can lose data due to this behavior ☹️

Is there a possibility to tell splunk that the maximum replication-factor per site must be 2, not 4?

0 Karma

jpillai
Path Finder

@flotridai  I think you just need to explicitly specify each sites replication factor as below.

site_replication_factor = origin:2,site1:2,site2:2,total:4

 In my case, the status of indexes were mistakenly understood as splunk trying to achieve RF 4 on the local site, rather it was just showing that half the replicas are missing and no replication was in progress.

0 Karma

isoutamo
SplunkTrust
SplunkTrust

Hi

based on https://docs.splunk.com/Documentation/Splunk/9.1.0/Indexer/Sitereplicationfactor it shouldn’t work like this. You could try to add site1:2,site2:2 to site_replication_factor. Then based on docs it should work. When another site is down it should store 2 buckets on current/origin site and report that SRF cannot met. If it do something else you should report a bug to splunk support.

As docs are not crystal clear about this you could also ask that they clarify this situation into docs  

r. Ismo

0 Karma

jpillai
Path Finder

Update: The kind of failures we are usually expecting are network failures, where the failed site will be back in few hours. In the mean time we might not want 4 replicas in the same site that is up. Or in case we need additional replicas in any case, we want to do it manually

Get Updates on the Splunk Community!

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...

Let’s Get You Certified – Vegas-Style at .conf24

Are you ready to level up your Splunk game? Then, let’s get you certified live at .conf24 – our annual user ...