We have 8-node indexer multi-site cluster with total of 2 sites (each site with 4 indexers)
Site 1 -
SITE1IDX01
SITE1IDX02
SITE1IDX03
SITE1IDX04
Site 2 -
SITE2IDX01
SITE2IDX02
SITE2IDX03
SITE2IDX04
Replication Settings -
site_replication_factor=origin:2,total:3
site_search_factor=origin:1,total:2
Whenever there are connectivity issues / glitches between the two sites, we start to notice below failures -
WARN BucketReplicator - Failed to replicate Streaming bucket to guid=**** host=***** s2sport=**** Read timed out after 60 sec (between Site 1 & Site 2 indexers)
And these failures have a cascading effect -
Connectivity errors between servers on Site 1 & Site 2 ---> this affect replication of hot streaming buckets between the 2 sites ---> which puts a pressure on splunk replication queue ---> replication is part of index queue and hence indexing queues get filled ---> since indexing queues are supposed to process incoming ingestion, being full they drop the ingestion traffic.
Need help over these points -
With my research, it seems to be failure messages for primary bucket reassignment fixups which are not avoided by cluster maintenance mode unlike raw & searchable copies fixups and one possible solution is to utilize offline command for a peer. Next is to explore ways to avoid them during some network glitch or n/w maintenance activities that can affect communication between the 2 sites in a multi-site cluster.
I’m afraiding that there is no other solution than fix your network to enough stable to avoid those issued. The whole idea of multi site cluster is ensuring that your data has replicated over sites and nodes.
There could be some parameters which you could try to increase queues size and timeout, but before you should try those, you should contact to splunk support.
To have more searchable copies will help in this scenario ? so as to speed up primary bucket reassignments ?
Other than n/w glitches, there are planned n/w maintenance activities during which we are supposed to minimize the impact over multi-site cluster - How do we deal with this ?
Maintenance mode has nothing to do with this. Maintenance mode stops fixups on buckets within the cluster.
Do you useACK=true? If so, I't try fiddling with ack_factor.
We have our ingestion working through HEC , no universal & heavy forwards in place.
My whole concern is to avoid congestion on living indexer queues due to 'Streaming bucket failures' during network & other maintenance activities where indexer server restart is needed.
Splunk only provides maintenance mode to minimize the impact but this is not helping here.
HEC can also be configured to use ACK from what I recall.
Sure. useACK has its limitations and while useful it has to be used with caution.
I was just pointing to the fact that while OP is using HEC there still can be ACK enabled somewhere underneath (and it would give that effect described in the opening post since the events are queued untill they are properly ack-ed).
It's even the same case when we restart our indexers in rolling manner even having maintenance mode enabled.
We immediately notice streaming bucket failures and end up blocking our queues on other living indexer servers causing ingestion to drop.