How do I fix this error: WARN BucketReplicator - F...

sagaraverma · ‎06-14-2023

We have 8-node indexer multi-site cluster with total of 2 sites (each site with 4 indexers)

Site 1 -
SITE1IDX01
SITE1IDX02
SITE1IDX03
SITE1IDX04

Site 2 -
SITE2IDX01
SITE2IDX02
SITE2IDX03
SITE2IDX04

Replication Settings -
site_replication_factor=origin:2,total:3
site_search_factor=origin:1,total:2

Whenever there are connectivity issues / glitches between the two sites, we start to notice below failures -

WARN BucketReplicator - Failed to replicate Streaming bucket to guid=**** host=***** s2sport=**** Read timed out after 60 sec (between Site 1 & Site 2 indexers)

And these failures have a cascading effect -

Connectivity errors between servers on Site 1 & Site 2 ---> this affect replication of hot streaming buckets between the 2 sites ---> which puts a pressure on splunk replication queue ---> replication is part of index queue and hence indexing queues get filled ---> since indexing queues are supposed to process incoming ingestion, being full they drop the ingestion traffic.

Need help over these points -

We assume that such errors will still effect even if you put cluster into maintenance mode. How do we then support network related maintenance activities for Splunk Multi-Site cluster ?
How can we make multi-site cluster more resilient to not end up blocking index queues due to underlying network glitches between the indexers on two sites.

sagaraverma · ‎06-21-2023

With my research, it seems to be failure messages for primary bucket reassignment fixups which are not avoided by cluster maintenance mode unlike raw & searchable copies fixups and one possible solution is to utilize offline command for a peer. Next is to explore ways to avoid them during some network glitch or n/w maintenance activities that can affect communication between the 2 sites in a multi-site cluster.

isoutamo · ‎06-21-2023

I’m afraiding that there is no other solution than fix your network to enough stable to avoid those issued. The whole idea of multi site cluster is ensuring that your data has replicated over sites and nodes.
There could be some parameters which you could try to increase queues size and timeout, but before you should try those, you should contact to splunk support.

sagaraverma · ‎06-21-2023

To have more searchable copies will help in this scenario ? so as to speed up primary bucket reassignments ?

Other than n/w glitches, there are planned n/w maintenance activities during which we are supposed to minimize the impact over multi-site cluster - How do we deal with this ?

PickleRick · ‎06-16-2023

Maintenance mode has nothing to do with this. Maintenance mode stops fixups on buckets within the cluster.

Do you useACK=true? If so, I't try fiddling with ack_factor.

sagaraverma · ‎06-16-2023

We have our ingestion working through HEC , no universal & heavy forwards in place.

My whole concern is to avoid congestion on living indexer queues due to 'Streaming bucket failures' during network & other maintenance activities where indexer server restart is needed.

Splunk only provides maintenance mode to minimize the impact but this is not helping here.

PickleRick · ‎06-17-2023

HEC can also be configured to use ACK from what I recall.

isoutamo · ‎06-21-2023

Yes, it can configured to your it, but in true life you shouldn’t 😕 There are some not so nice side effects caused by it (or at least in older versions have those).

PickleRick · ‎06-21-2023

Sure. useACK has its limitations and while useful it has to be used with caution.

I was just pointing to the fact that while OP is using HEC there still can be ACK enabled somewhere underneath (and it would give that effect described in the opening post since the events are queued untill they are properly ack-ed).

sagaraverma · ‎06-16-2023

It's even the same case when we restart our indexers in rolling manner even having maintenance mode enabled.
We immediately notice streaming bucket failures and end up blocking our queues on other living indexer servers causing ingestion to drop.

How do I fix this error: WARN BucketReplicator - Failed to replicate Streaming bucket?

indexer clustering

How to Monitor Google Kubernetes Engine (GKE)

Index This | How can you make 45 using only 4?

Splunk Education Goes to Washington | Splunk GovSummit 2024