Deployment Architecture

Replication Factor and Search Factor is never met on Cluster Master

Splunk Employee
Splunk Employee

I have a Cluster Master with two Cluster Peer, with the
Replication Factor=3
Search Factor=2

Due to some reason, the Replication Factor and Search Factor is never met.
1. Why does the cluster master complain that Replication factor is not met
2. On the Cluster master Peer Status flaps between "Up" and "Pending".

Path Finder

it is most likely happening because of corrupted buckets, you can see them in cluster master webpage as well. to fix the issue you need to remove them. please see how to remove bucket in this post

https://answers.splunk.com/answers/184484/what-should-i-do-with-bad-buckets-in-a-clustered-e.html?so...

Motivator

Just stumbed upon this question, I think your answer is in your first line of your question.

2 peer cluster, replication of 3.

"♫ cause two out of three ain't bad ♫"

You only have 2 peers. How can you get 3 copies with only 2 peers when a peer can only have a single copy? 😉

TL;DR, your rep factor is larger than your number of peers. You need to either add another peer or reduce the replication factor.

Splunk Employee
Splunk Employee

The splunkd.log from the cluster Master and essentially shows two issues see two issues
(i) Peer transitioning from=Up to=Pending reason and reverse and
(ii) handleReplicationError between peers
Here is the extract from the log file
01-29-2014 09:30:34.766 -0500 INFO CMPeer - peer=93BF03E9-AF2F-415C-A81D-D7CFDE0FD0B1 transitioning from=Up to=Pending reason="non-streaming failure"
01-29-2014 09:30:34.766 -0500 INFO CMMaster - event=handleReplicationError bid=_audit~10~17EBCE68-0006-4E11-B79A-261C6E98AF2A tgt=93BF03E9-AF2F-415C-A81D-D7CFDE0FD0B1 msg='target doesn't have bucket now. ignoring'
01-29-2014 09:30:34.768 -0500 INFO CMMaster - replication error src=93BF03E9-AF2F-415C-A81D-D7CFDE0FD0B1 tgt=17EBCE68-0006-4E11-B79A-261C6E98AF2A failing=tgt bid=_audit~10~93BF03E9-AF2F-415C-A81D-D7CFDE0FD0B1
01-29-2014 09:30:34.768 -0500 INFO CMReplicationRegistry - Finished replication: bid=_audit~10~93BF03E9-AF2F-415C-A81D-D7CFDE0FD0B1 src=93BF03E9-AF2F-415C-A81D-D7CFDE0FD0B1 target=17EBCE68-0006-4E11-B79A-261C6E98AF2A
01-29-2014 09:30:34.768 -0500 INFO CMPeer - peer=17EBCE68-0006-4E11-B79A-261C6E98AF2A transitioning from=Up to=Pending reason="non-streaming failure"
01-29-2014 09:30:34.768 -0500 INFO CMMaster - event=handleReplicationError bid=_audit~10~93BF03E9-AF2F-415C-A81D-D7CFDE0FD0B1 tgt=17EBCE68-0006-4E11-B79A-261C6E98AF2A msg='target doesn't have bucket now. ignoring'
01-29-2014 09:30:34.771 -0500 INFO CMMaster - replication error src=93BF03E9-AF2F-415C-A81D-D7CFDE0FD0B1 tgt=17EBCE68-0006-4E11-B79A-261C6E98AF2A failing=tgt bid=_audit~11~93BF03E9-AF2F-415C-A81D-D7CFDE0FD0B1
04-29-2014 09:30:35.145 -0500 INFO CMPeer - peer=17EBCE68-0006-4E11-B79A-261C6E98AF2A transitioning from=Pending to=Up reason="heartbeat received."

The Peer log files shows error like below

04-29-2014 09:17:19.717 -0500 WARN TcpOutputFd - Connect to 10.111.111.217:8080 failed. No route to host
04-29-2014 09:17:19.732 -0500 ERROR TcpOutputFd - Connection to host=10.111.111.217:8080 failed ......
04-30-2014 08:13:30.216 -0500 WARN TcpOutputFd - Connect to 10.111.111.216:8080 failed. No route to host
04-30-2014 08:13:30.216 -0500 ERROR TcpOutputFd - Connection to host=10.111.111.216:8080 faile

Issue was that replicated port was not opened between Peers. To check the connectivity used
From Peer1 : telnet Peer2
From Peer3 : telnet Peer2

After the required management port was opened , restart the cluster environment using

Stop Cluster Master
Stop both Peers
Start the cluster Master
Start both Peers

State of Splunk Careers

Access the Splunk Careers Report to see real data that shows how Splunk mastery increases your value and job satisfaction.

Find out what your skills are worth!