Platform has been live for close to two year.
Firewall ports are still open.
MTU is still 1500 has not been changed.
No error in the OS logs.
3 of my nine clusters started doing this. In fact 7 our of 9 have search factor not meet errors.
Which where fixed by putting CM into main. manual rolling restart of the Indexer and restarting CM and taking it out of Main mode.
But I ma left with three with 1000's of bucket fix ups and this error now.
You are probably at a point where you need to increase timeouts, this is a pinned post from vraptor (sh-cluster channel) on community slack.
I'd expect the most important for your environment will be:
[clustering]
cxn_timeout = 300
rcv_timeout = 300
send_timeout = 300
But feel free to test, I've pasted the entire block of information below
Indexer Side (All Indexers and Cluster Master) distsearch.conf [replicationSettings] sendRcvTimeout = 120 server.conf [httpServer] busyKeepAliveIdleTimeout = 180 streamInWriteTimeout = 30 [sslConfig] useClientSSLCompression=false # Specifies how long before an intra-cluster connection will terminate. Default = 60. # If a cluster indexer times out, it will re-add itself to the CM, which itself is a busy # operation (it needs to resync the state of all its buckets), which can lead to negative # feedback loop. These can be bumped up for busier clusters (300s). [clustering] cxn_timeout = 300 rcv_timeout = 300 send_timeout = 300 Search Head Side (All Search Heads) distsearch.conf [distributedSearch] statusTimeout = 120 connectionTimeout = 120 authTokenConnectionTimeout = 120 authTokenSendTimeout = 120 authTokenReceiveTimeout = 120 [replicationSettings] connectionTimeout = 120 sendRcvTimeout = 120 server.conf [sslConfig] useClientSSLCompression=false On the cluster master in server.conf: # Specifies how long before an Indexer is considered 'Down' when no heartbeats comes in # Multiple of heartbeat_period, anywhere from 20x – 60x heartbeat_timeout = 400 On the Indexers in server.conf: # Specifies how often the Indexers contact the CM. Defaults to every 1 second # For lots of peers ( >50) or lots of buckets (>100k), we can increase this value to 5-30 [clustering] heartbeat_period = 5 On the indexers in indexes.conf--note that this is a per index setting: # Specifies how o@en to check through all the buckets, rolling # them from hot to warm to cold necessary. Default is 60 seconds rotatePeriodInSecs=600
I not saying this fixed it. but this made it go away for now.
I set on the CM
./splunk edit cluster-config -max_peer_build_load 8
./splunk edit cluster-config -max_peer_rep_load 20
I put the CM into maint mode.
And on the indexer I set
max_replication_errors to 10 from 3
I then rebooted each indexer just to get them as clean as possible. Once I had rebooted each Index and all where up and working. I restart the CM and took it out of Maint mode. After about 3 hours is sorted itself. then i reset back to the original values..