Deployment Architecture

SHC New Member reverts to down after restart

jondukehds
Explorer

We are attempting to add three nodes to an existing SHC. We are able to add them, however they do not survive a restart without going back to a "DOWN" state. The fix is to stop splunk, clean raft, and add it back. However doing this only changes it's status to "UP" until the next restart.

/opt/splunk/bin/splunk clean raft
/opt/splunk/bin/splunk start
/opt/splunk/bin/splunk add shcluster-member -current_member_uri https://<url>:8089

Any ideas? Should we perform a bootstrap as shown here?

https://docs.splunk.com/Documentation/Splunk/8.0.4/DistSearch/Handleraftissues

 

 

Labels (2)
0 Karma
1 Solution

jondukehds
Explorer

Solution -

Our issues were related to RAFT protocol issues,  mgmt_uri mismatch and appendentries errors, on the existing members. As suggestion of splunk support we set the SHC to a static captain, which bypasses RAFT, and afterwards everything was fine.

We edited the /opt/splunk/etc/system/local/server.conf to include these entries under [shclustering] to increase timeouts for RAFT and connectivity.

captain_is_adhoc_searchhead = true
cxn_timeout_raft = 6
rcv_timeout_raft = 10
send_timeout_raft = 10
cxn_timeout = 120
send_timeout = 120
rcv_timeout = 120
election_timeout_ms = 120000
heartbeat_timeout = 120

 

Then set a static captain as described here.

https://docs.splunk.com/Documentation/Splunk/8.0.4/DistSearch/Staticcaptain

On captain

/opt/splunk/bin/splunk edit shcluster-config -mode captain -captain_uri https://captainuri:8089 -election false

on members.


/opt/splunk/bin/splunk edit shcluster-config -mode member -captain_uri https://captainuri:8089 -election false

At which point all nodes showed as cluster members successfully.

We then reverted back to dynamic captain, performed a bootstrap and performed a several rolling restarts to confirm members behaved as expected.

run on each member, captain last.

/opt/splunk/bin/splunk edit shcluster-config -election true -mgmt_uri https://memberurl:8089

Then bootstrap to "rebuild" member entries in KV store .

/opt/splunk/bin/splunk bootstrap shcluster-captain -servers_list <URI>:<management_port>,<URI>:<management_port>."

 

 

 

View solution in original post

0 Karma

jondukehds
Explorer

Solution -

Our issues were related to RAFT protocol issues,  mgmt_uri mismatch and appendentries errors, on the existing members. As suggestion of splunk support we set the SHC to a static captain, which bypasses RAFT, and afterwards everything was fine.

We edited the /opt/splunk/etc/system/local/server.conf to include these entries under [shclustering] to increase timeouts for RAFT and connectivity.

captain_is_adhoc_searchhead = true
cxn_timeout_raft = 6
rcv_timeout_raft = 10
send_timeout_raft = 10
cxn_timeout = 120
send_timeout = 120
rcv_timeout = 120
election_timeout_ms = 120000
heartbeat_timeout = 120

 

Then set a static captain as described here.

https://docs.splunk.com/Documentation/Splunk/8.0.4/DistSearch/Staticcaptain

On captain

/opt/splunk/bin/splunk edit shcluster-config -mode captain -captain_uri https://captainuri:8089 -election false

on members.


/opt/splunk/bin/splunk edit shcluster-config -mode member -captain_uri https://captainuri:8089 -election false

At which point all nodes showed as cluster members successfully.

We then reverted back to dynamic captain, performed a bootstrap and performed a several rolling restarts to confirm members behaved as expected.

run on each member, captain last.

/opt/splunk/bin/splunk edit shcluster-config -election true -mgmt_uri https://memberurl:8089

Then bootstrap to "rebuild" member entries in KV store .

/opt/splunk/bin/splunk bootstrap shcluster-captain -servers_list <URI>:<management_port>,<URI>:<management_port>."

 

 

 

0 Karma

esix_splunk
Splunk Employee
Splunk Employee

Is the deployer online and available to these nodes? There does need to be connectivity for the initial join..

 

Regarding re-adding these, you'll probably want to to do a splunk resync shcluster-replicated-config to force the members to reach out to the captain and replicate configurations. If those members have been out of the cluster for quite some time, more then 24h, you probably should do a full clean on them. See the documentation here on this : https://docs.splunk.com/Documentation/Splunk/8.0.4/DistSearch/Addaclustermember#Add_a_member_that_le...

Tags (1)
0 Karma
Get Updates on the Splunk Community!

Now Available: Cisco Talos Threat Intelligence Integrations for Splunk Security Cloud ...

At .conf24, we shared that we were in the process of integrating Cisco Talos threat intelligence into Splunk ...

Preparing your Splunk Environment for OpenSSL3

The Splunk platform will transition to OpenSSL version 3 in a future release. Actions are required to prepare ...

Easily Improve Agent Saturation with the Splunk Add-on for OpenTelemetry Collector

Agent Saturation What and Whys In application performance monitoring, saturation is defined as the total load ...