Deployment Architecture

Too many bucket replication errors to target peer

andynewsoncap
Engager

Platform has been live for close to two year.
Firewall ports are still open.
MTU is still 1500 has not been changed.
No error in the OS logs.
3 of my nine clusters started doing this.  In fact 7 our of 9 have search factor not meet errors. 
Which where fixed by putting CM into main. manual rolling restart of the Indexer and restarting CM and taking it out of Main mode. 
But I ma left with three with 1000's of bucket fix ups and this error now.

Labels (1)
0 Karma

gjanders
SplunkTrust
SplunkTrust

You are probably at a point where you need to increase timeouts, this is a pinned post from vraptor (sh-cluster channel) on community slack.

I'd expect the most important for your environment will be:

[clustering]
cxn_timeout = 300
rcv_timeout = 300
send_timeout = 300

But feel free to test, I've pasted the entire block of information below

Indexer Side (All Indexers and Cluster Master) 
distsearch.conf 
[replicationSettings] 
sendRcvTimeout = 120 

server.conf 
[httpServer] 
busyKeepAliveIdleTimeout = 180 
streamInWriteTimeout = 30 

[sslConfig] 
useClientSSLCompression=false 


# Specifies how long before an intra-cluster connection will terminate. Default = 60.
# If a cluster indexer times out, it will re-add itself to the CM, which itself is a busy 
# operation (it needs to resync the state of all its buckets), which can lead to negative 
# feedback loop. These can be bumped up for busier clusters (300s).
[clustering]
cxn_timeout = 300
rcv_timeout = 300
send_timeout = 300

Search Head Side (All Search Heads) 

distsearch.conf 
[distributedSearch] 
statusTimeout = 120 
connectionTimeout = 120 
authTokenConnectionTimeout = 120 
authTokenSendTimeout = 120 
authTokenReceiveTimeout = 120 
[replicationSettings] 
connectionTimeout = 120 
sendRcvTimeout = 120 

server.conf 
[sslConfig] 
useClientSSLCompression=false 

On the cluster master in server.conf: 
# Specifies how long before an Indexer is considered 'Down' when no heartbeats comes in
# Multiple of heartbeat_period, anywhere from 20x – 60x
heartbeat_timeout = 400

On the Indexers in server.conf:
# Specifies how often the Indexers contact the CM. Defaults to every 1 second
# For lots of peers ( >50) or lots of buckets (>100k), we can increase this value to 5-30
[clustering]
heartbeat_period = 5

On the indexers in indexes.conf--note that this is a per index setting:
# Specifies how o@en to check through all the buckets, rolling 
# them from hot to warm to cold necessary. Default is 60 seconds
rotatePeriodInSecs=600

 

0 Karma

andynewsoncap
Engager

I not saying this fixed it. but this made it go away for now.

I set on the CM

./splunk edit cluster-config -max_peer_build_load 8

./splunk edit cluster-config -max_peer_rep_load 20

I put the CM into maint mode.

And on the indexer I set 

max_replication_errors to 10 from 3

I then rebooted each indexer just to get them as clean as possible.  Once I had rebooted each Index and all where up and working.  I restart the CM and took it out of Maint mode.  After about 3 hours is sorted itself.  then i reset back to the original values..

0 Karma
Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...