Deployment Architecture

How do I fix this error: WARN BucketReplicator - Failed to replicate Streaming bucket?

sagaraverma
Loves-to-Learn Everything

We have 8-node indexer multi-site cluster with total of 2 sites (each site with 4 indexers)

Site 1 -
SITE1IDX01
SITE1IDX02
SITE1IDX03
SITE1IDX04

Site 2 -
SITE2IDX01
SITE2IDX02
SITE2IDX03
SITE2IDX04

Replication Settings -
site_replication_factor=origin:2,total:3
site_search_factor=origin:1,total:2

Whenever there are connectivity issues / glitches between the two sites, we start to notice below failures -

WARN BucketReplicator - Failed to replicate Streaming bucket to guid=**** host=***** s2sport=**** Read timed out after 60 sec (between Site 1 & Site 2 indexers)

And these failures have a cascading effect -

Connectivity errors between servers on Site 1 & Site 2 ---> this affect replication of hot streaming buckets between the 2 sites ---> which puts a pressure on splunk replication queue ---> replication is part of index queue and hence indexing queues get filled ---> since indexing queues are supposed to process incoming ingestion, being full they drop the ingestion traffic.

Need help over these points -

  • We assume that such errors will still effect even if you put cluster into maintenance mode. How do we then support network related maintenance activities for Splunk Multi-Site cluster ?
  • How can we make multi-site cluster more resilient to not end up blocking index queues due to underlying network glitches between the indexers on two sites.
Labels (1)
0 Karma

sagaraverma
Loves-to-Learn Everything

With my research, it seems to be failure messages for primary bucket reassignment fixups which are not avoided by cluster maintenance mode unlike raw & searchable copies fixups and one possible solution is to utilize offline command for a peer. Next is to explore ways to avoid them during some network glitch or n/w maintenance activities that can affect communication between the 2 sites in a multi-site cluster.

0 Karma

isoutamo
SplunkTrust
SplunkTrust

I’m afraiding that there is no other solution than fix your network to enough stable to avoid those issued. The whole idea of multi site cluster is ensuring that your data has replicated over sites and nodes. 
There could be some parameters which you could try to increase queues size and timeout, but before you should try those, you should contact to splunk support. 

0 Karma

sagaraverma
Loves-to-Learn Everything

To have more searchable copies will help in this scenario ? so as to speed up primary bucket reassignments ?

Other than n/w glitches, there are planned n/w maintenance activities during which we are supposed to minimize the impact over multi-site cluster - How do we deal with this ?

0 Karma

PickleRick
SplunkTrust
SplunkTrust

Maintenance mode has nothing to do with this. Maintenance mode stops fixups on buckets within the cluster.

Do you useACK=true? If so, I't try fiddling with ack_factor.

0 Karma

sagaraverma
Loves-to-Learn Everything

We have our ingestion working through HEC , no universal & heavy forwards in place.

My whole concern is to avoid congestion on living indexer queues due to 'Streaming bucket failures' during network & other maintenance activities where indexer server restart is needed.

Splunk only provides maintenance mode to minimize the impact but this is not helping here.

0 Karma

PickleRick
SplunkTrust
SplunkTrust

HEC can also be configured to use ACK from what I recall.

0 Karma

isoutamo
SplunkTrust
SplunkTrust
Yes, it can configured to your it, but in true life you shouldn’t 😕 There are some not so nice side effects caused by it (or at least in older versions have those).
0 Karma

PickleRick
SplunkTrust
SplunkTrust

Sure. useACK has its limitations and while useful it has to be used with caution.

I was just pointing to the fact that while OP is using HEC there still can be ACK enabled somewhere underneath (and it would give that effect described in the opening post since the events are queued untill they are properly ack-ed).

0 Karma

sagaraverma
Loves-to-Learn Everything

It's even the same case when we restart our indexers in rolling manner even having maintenance mode enabled. 
We immediately notice streaming bucket failures and end up blocking our queues on other living indexer servers causing ingestion to drop.

0 Karma
Get Updates on the Splunk Community!

Now Available: Cisco Talos Threat Intelligence Integrations for Splunk Security Cloud ...

At .conf24, we shared that we were in the process of integrating Cisco Talos threat intelligence into Splunk ...

Preparing your Splunk Environment for OpenSSL3

The Splunk platform will transition to OpenSSL version 3 in a future release. Actions are required to prepare ...

Easily Improve Agent Saturation with the Splunk Add-on for OpenTelemetry Collector

Agent Saturation What and Whys In application performance monitoring, saturation is defined as the total load ...