I have a Cluster Master with two Cluster Peer, with the
Replication Factor=3
Search Factor=2
Due to some reason, the Replication Factor and Search Factor is never met.
1. Why does the cluster master complain that Replication factor is not met
2. On the Cluster master Peer Status flaps between "Up" and "Pending".
it is most likely happening because of corrupted buckets, you can see them in cluster master webpage as well. to fix the issue you need to remove them. please see how to remove bucket in this post
Just stumbed upon this question, I think your answer is in your first line of your question.
2 peer cluster, replication of 3.
"♫ cause two out of three ain't bad ♫"
You only have 2 peers. How can you get 3 copies with only 2 peers when a peer can only have a single copy? 😉
TL;DR, your rep factor is larger than your number of peers. You need to either add another peer or reduce the replication factor.
The splunkd.log from the cluster Master and essentially shows two issues see two issues
(i) Peer transitioning from=Up to=Pending reason and reverse and
(ii) handleReplicationError between peers
Here is the extract from the log file
01-29-2014 09:30:34.766 -0500 INFO CMPeer - peer=93BF03E9-AF2F-415C-A81D-D7CFDE0FD0B1 transitioning from=Up to=Pending reason="non-streaming failure"
01-29-2014 09:30:34.766 -0500 INFO CMMaster - event=handleReplicationError bid=_audit~10~17EBCE68-0006-4E11-B79A-261C6E98AF2A tgt=93BF03E9-AF2F-415C-A81D-D7CFDE0FD0B1 msg='target doesn't have bucket now. ignoring'
01-29-2014 09:30:34.768 -0500 INFO CMMaster - replication error src=93BF03E9-AF2F-415C-A81D-D7CFDE0FD0B1 tgt=17EBCE68-0006-4E11-B79A-261C6E98AF2A failing=tgt bid=_audit~10~93BF03E9-AF2F-415C-A81D-D7CFDE0FD0B1
01-29-2014 09:30:34.768 -0500 INFO CMReplicationRegistry - Finished replication: bid=_audit~10~93BF03E9-AF2F-415C-A81D-D7CFDE0FD0B1 src=93BF03E9-AF2F-415C-A81D-D7CFDE0FD0B1 target=17EBCE68-0006-4E11-B79A-261C6E98AF2A
01-29-2014 09:30:34.768 -0500 INFO CMPeer - peer=17EBCE68-0006-4E11-B79A-261C6E98AF2A transitioning from=Up to=Pending reason="non-streaming failure"
01-29-2014 09:30:34.768 -0500 INFO CMMaster - event=handleReplicationError bid=_audit~10~93BF03E9-AF2F-415C-A81D-D7CFDE0FD0B1 tgt=17EBCE68-0006-4E11-B79A-261C6E98AF2A msg='target doesn't have bucket now. ignoring'
01-29-2014 09:30:34.771 -0500 INFO CMMaster - replication error src=93BF03E9-AF2F-415C-A81D-D7CFDE0FD0B1 tgt=17EBCE68-0006-4E11-B79A-261C6E98AF2A failing=tgt bid=_audit~11~93BF03E9-AF2F-415C-A81D-D7CFDE0FD0B1
04-29-2014 09:30:35.145 -0500 INFO CMPeer - peer=17EBCE68-0006-4E11-B79A-261C6E98AF2A transitioning from=Pending to=Up reason="heartbeat received."
The Peer log files shows error like below
04-29-2014 09:17:19.717 -0500 WARN TcpOutputFd - Connect to 10.111.111.217:8080 failed. No route to host
04-29-2014 09:17:19.732 -0500 ERROR TcpOutputFd - Connection to host=10.111.111.217:8080 failed ......
04-30-2014 08:13:30.216 -0500 WARN TcpOutputFd - Connect to 10.111.111.216:8080 failed. No route to host
04-30-2014 08:13:30.216 -0500 ERROR TcpOutputFd - Connection to host=10.111.111.216:8080 faile
Issue was that replicated port was not opened between Peers. To check the connectivity used
From Peer1 : telnet
From Peer3 : telnet
After the required management port was opened , restart the cluster environment using
Stop Cluster Master
Stop both Peers
Start the cluster Master
Start both Peers