Solved: Re: SHC New Member reverts to down after restart

jondukehds · ‎06-17-2020

We are attempting to add three nodes to an existing SHC. We are able to add them, however they do not survive a restart without going back to a "DOWN" state. The fix is to stop splunk, clean raft, and add it back. However doing this only changes it's status to "UP" until the next restart.

/opt/splunk/bin/splunk clean raft
/opt/splunk/bin/splunk start
/opt/splunk/bin/splunk add shcluster-member -current_member_uri https://<url>:8089

Any ideas? Should we perform a bootstrap as shown here?

https://docs.splunk.com/Documentation/Splunk/8.0.4/DistSearch/Handleraftissues

jondukehds · ‎06-22-2020

Solution -

Our issues were related to RAFT protocol issues, mgmt_uri mismatch and appendentries errors, on the existing members. As suggestion of splunk support we set the SHC to a static captain, which bypasses RAFT, and afterwards everything was fine.

We edited the /opt/splunk/etc/system/local/server.conf to include these entries under [shclustering] to increase timeouts for RAFT and connectivity.

captain_is_adhoc_searchhead = true
cxn_timeout_raft = 6
rcv_timeout_raft = 10
send_timeout_raft = 10
cxn_timeout = 120
send_timeout = 120
rcv_timeout = 120
election_timeout_ms = 120000
heartbeat_timeout = 120

Then set a static captain as described here.

https://docs.splunk.com/Documentation/Splunk/8.0.4/DistSearch/Staticcaptain

On captain

/opt/splunk/bin/splunk edit shcluster-config -mode captain -captain_uri https://captainuri:8089 -election false

on members.

/opt/splunk/bin/splunk edit shcluster-config -mode member -captain_uri https://captainuri:8089 -election false

At which point all nodes showed as cluster members successfully.

We then reverted back to dynamic captain, performed a bootstrap and performed a several rolling restarts to confirm members behaved as expected.

run on each member, captain last.

/opt/splunk/bin/splunk edit shcluster-config -election true -mgmt_uri https://memberurl:8089

Then bootstrap to "rebuild" member entries in KV store .

/opt/splunk/bin/splunk bootstrap shcluster-captain -servers_list <URI>:<management_port>,<URI>:<management_port>."

View solution in original post

jondukehds · ‎06-22-2020

Solution -

Our issues were related to RAFT protocol issues, mgmt_uri mismatch and appendentries errors, on the existing members. As suggestion of splunk support we set the SHC to a static captain, which bypasses RAFT, and afterwards everything was fine.

We edited the /opt/splunk/etc/system/local/server.conf to include these entries under [shclustering] to increase timeouts for RAFT and connectivity.

captain_is_adhoc_searchhead = true
cxn_timeout_raft = 6
rcv_timeout_raft = 10
send_timeout_raft = 10
cxn_timeout = 120
send_timeout = 120
rcv_timeout = 120
election_timeout_ms = 120000
heartbeat_timeout = 120

Then set a static captain as described here.

https://docs.splunk.com/Documentation/Splunk/8.0.4/DistSearch/Staticcaptain

On captain

/opt/splunk/bin/splunk edit shcluster-config -mode captain -captain_uri https://captainuri:8089 -election false

on members.

/opt/splunk/bin/splunk edit shcluster-config -mode member -captain_uri https://captainuri:8089 -election false

At which point all nodes showed as cluster members successfully.

We then reverted back to dynamic captain, performed a bootstrap and performed a several rolling restarts to confirm members behaved as expected.

run on each member, captain last.

/opt/splunk/bin/splunk edit shcluster-config -election true -mgmt_uri https://memberurl:8089

Then bootstrap to "rebuild" member entries in KV store .

/opt/splunk/bin/splunk bootstrap shcluster-captain -servers_list <URI>:<management_port>,<URI>:<management_port>."

esix_splunk · ‎06-21-2020

Is the deployer online and available to these nodes? There does need to be connectivity for the initial join..

Regarding re-adding these, you'll probably want to to do a splunk resync shcluster-replicated-config to force the members to reach out to the captain and replicate configurations. If those members have been out of the cluster for quite some time, more then 24h, you probably should do a full clean on them. See the documentation here on this : https://docs.splunk.com/Documentation/Splunk/8.0.4/DistSearch/Addaclustermember#Add_a_member_that_le...

SHC New Member reverts to down after restart

captain

search head clustering

Fueling your curiosity with new Splunk ILT and eLearning courses

Splunk AI Assistant for SPL 1.1.0 | Now Personalized to Your Environment for Greater ...

Unleash Unified Security and Observability with Splunk Cloud Platform