We are attempting to add three nodes to an existing SHC. We are able to add them, however they do not survive a restart without going back to a "DOWN" state. The fix is to stop splunk, clean raft, and add it back. However doing this only changes it's status to "UP" until the next restart.
/opt/splunk/bin/splunk clean raft
/opt/splunk/bin/splunk start
/opt/splunk/bin/splunk add shcluster-member -current_member_uri https://<url>:8089
Any ideas? Should we perform a bootstrap as shown here?
https://docs.splunk.com/Documentation/Splunk/8.0.4/DistSearch/Handleraftissues
Solution -
Our issues were related to RAFT protocol issues, mgmt_uri mismatch and appendentries errors, on the existing members. As suggestion of splunk support we set the SHC to a static captain, which bypasses RAFT, and afterwards everything was fine.
We edited the /opt/splunk/etc/system/local/server.conf to include these entries under [shclustering] to increase timeouts for RAFT and connectivity.
captain_is_adhoc_searchhead = true
cxn_timeout_raft = 6
rcv_timeout_raft = 10
send_timeout_raft = 10
cxn_timeout = 120
send_timeout = 120
rcv_timeout = 120
election_timeout_ms = 120000
heartbeat_timeout = 120
Then set a static captain as described here.
https://docs.splunk.com/Documentation/Splunk/8.0.4/DistSearch/Staticcaptain
On captain
/opt/splunk/bin/splunk edit shcluster-config -mode captain -captain_uri https://captainuri:8089 -election false
on members.
/opt/splunk/bin/splunk edit shcluster-config -mode member -captain_uri https://captainuri:8089 -election false
At which point all nodes showed as cluster members successfully.
We then reverted back to dynamic captain, performed a bootstrap and performed a several rolling restarts to confirm members behaved as expected.
run on each member, captain last.
/opt/splunk/bin/splunk edit shcluster-config -election true -mgmt_uri https://memberurl:8089
Then bootstrap to "rebuild" member entries in KV store .
/opt/splunk/bin/splunk bootstrap shcluster-captain -servers_list <URI>:<management_port>,<URI>:<management_port>."
Solution -
Our issues were related to RAFT protocol issues, mgmt_uri mismatch and appendentries errors, on the existing members. As suggestion of splunk support we set the SHC to a static captain, which bypasses RAFT, and afterwards everything was fine.
We edited the /opt/splunk/etc/system/local/server.conf to include these entries under [shclustering] to increase timeouts for RAFT and connectivity.
captain_is_adhoc_searchhead = true
cxn_timeout_raft = 6
rcv_timeout_raft = 10
send_timeout_raft = 10
cxn_timeout = 120
send_timeout = 120
rcv_timeout = 120
election_timeout_ms = 120000
heartbeat_timeout = 120
Then set a static captain as described here.
https://docs.splunk.com/Documentation/Splunk/8.0.4/DistSearch/Staticcaptain
On captain
/opt/splunk/bin/splunk edit shcluster-config -mode captain -captain_uri https://captainuri:8089 -election false
on members.
/opt/splunk/bin/splunk edit shcluster-config -mode member -captain_uri https://captainuri:8089 -election false
At which point all nodes showed as cluster members successfully.
We then reverted back to dynamic captain, performed a bootstrap and performed a several rolling restarts to confirm members behaved as expected.
run on each member, captain last.
/opt/splunk/bin/splunk edit shcluster-config -election true -mgmt_uri https://memberurl:8089
Then bootstrap to "rebuild" member entries in KV store .
/opt/splunk/bin/splunk bootstrap shcluster-captain -servers_list <URI>:<management_port>,<URI>:<management_port>."
Is the deployer online and available to these nodes? There does need to be connectivity for the initial join..
Regarding re-adding these, you'll probably want to to do a splunk resync shcluster-replicated-config to force the members to reach out to the captain and replicate configurations. If those members have been out of the cluster for quite some time, more then 24h, you probably should do a full clean on them. See the documentation here on this : https://docs.splunk.com/Documentation/Splunk/8.0.4/DistSearch/Addaclustermember#Add_a_member_that_le...