Deployment Architecture

Multi-site indexer clustering: why isn't data sourced from the local site when replicating within the same site?

davidpaper
Contributor

Scenario:
multi-site cluster
site1 and site2
site_rep_factor=origin:2, total:3
site_search_factor=origin:2, total:3

bucket12345 has 2 copies in site1 (origin) and 1 copy in site2.

When a copy of the bucket is deleted in the origin site1 (the rb_* copy), the CM kicks off a job to make a new copy of that bucket. I see it being copied from an indexer in site2, instead of an indexer in site1. I expected Splunk to use a copy in the same site as the source, but it's not doing that.

Why?

0 Karma
1 Solution

davidpaper
Contributor

The logic behind bucket replication sourcing works like this:

1) We will prefer a local site source for RF replication.

2) However, if the local sources is already at max capacity for how many replications it can be involved in (max_peer_rep_load), then we can definitely go cross site for RF replication.

3) For SF replication, there is no preferences, it ends up being random.

On the CM in server.conf: [clustering] max_peer_rep_load can be used to throttle up/down how many replication jobs are happening at once. Lowering this will slow down non-streaming (warm/cold) bucket replication, but will not affect streaming (hot) bucket replication. This value represents "slots" for each indexer to participate in non-streaming replication, either as a source or as a target.

Huh, what? Need an example.

Imagine 3 peers on site1, with bucket A and B that we want to be replicated intrasite (site1 needs to have 2 copies of A and B buckets), and max_peer_rep_load=1 (for simplified example), and 1 peer on site2:

Site1:
Peer1 - Bucket A
Peer2 - Bucket B
Peer3 - Bucket C

Site2:
Peer4 - Bucket B

We may trigger replication of Bucket A on Peer1->Peer2. Since Peer1 & 2 are involved in a replication, both of the "peer rep" slots are now taken on Peer1 and Peer2.

Peer3 has a slot available, so it can get a replication of BucketB from some outside site (Peer4 in site2) since Peer2 doesn't have a slot available, thus triggering an inter-site copy.

Unfortunately, when we fix buckets, we fix them in some fixed (but random) order, and if the bucket we're scheduling next for replication doesn't have a Source on the local site, it will go to an alternate site.

A huge thank you to @dxu_splunk for the background to answer the question.

-dave

View solution in original post

davidpaper
Contributor

The logic behind bucket replication sourcing works like this:

1) We will prefer a local site source for RF replication.

2) However, if the local sources is already at max capacity for how many replications it can be involved in (max_peer_rep_load), then we can definitely go cross site for RF replication.

3) For SF replication, there is no preferences, it ends up being random.

On the CM in server.conf: [clustering] max_peer_rep_load can be used to throttle up/down how many replication jobs are happening at once. Lowering this will slow down non-streaming (warm/cold) bucket replication, but will not affect streaming (hot) bucket replication. This value represents "slots" for each indexer to participate in non-streaming replication, either as a source or as a target.

Huh, what? Need an example.

Imagine 3 peers on site1, with bucket A and B that we want to be replicated intrasite (site1 needs to have 2 copies of A and B buckets), and max_peer_rep_load=1 (for simplified example), and 1 peer on site2:

Site1:
Peer1 - Bucket A
Peer2 - Bucket B
Peer3 - Bucket C

Site2:
Peer4 - Bucket B

We may trigger replication of Bucket A on Peer1->Peer2. Since Peer1 & 2 are involved in a replication, both of the "peer rep" slots are now taken on Peer1 and Peer2.

Peer3 has a slot available, so it can get a replication of BucketB from some outside site (Peer4 in site2) since Peer2 doesn't have a slot available, thus triggering an inter-site copy.

Unfortunately, when we fix buckets, we fix them in some fixed (but random) order, and if the bucket we're scheduling next for replication doesn't have a Source on the local site, it will go to an alternate site.

A huge thank you to @dxu_splunk for the background to answer the question.

-dave

Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...