Solved: SHCluster Rolling restart stops frequently

sylim_splunk · ‎08-14-2019

After pushing a new shcluster bundle, the shcluster self-initiated a rolling restart. We have 15 servers in this cluster. 5 completed the restart at some point after the "splunk rolling-restart shcluster-members -status 1" command shows the 10th server in "restarting" status but at the very next time for the same command it started to respond saying that there was not a rolling restart in progress.
We also found the captain has moved to another search head.
Running the states command from there also showed that the rolling restart was not running.
Now, the problem is, these 10 servers has new configuration bundle but the other 5 that had not restarted are still showing messages about having received new configs that require the restart but they haven't been restarted to pick them up.

sylim_splunk · ‎08-14-2019

To fix the issue temporarily you need to restart manually the rest that haven't been restarted, that will allow them to pick up the new shcluster bundle successfully.

For versions available as of this writing the captain changes in the middle of rolling restart can happen but you need to find out why this is happening and get it fixed. Enhancement work is going on currently.

To understand the happenings between deployer, captain and members during the rolling restart check the below;
i) Deployer pushes bundle;
index=_internal host="SHC members" source=*/splunkd.log* "does not support reload" | dedup host| sort _time|table host _time

ii) Captain initiates rolling-restart of peers
index=_internal source=*/splunkd.log* "Starting a rolling restart of the peers" host=CaptainName

iii) SHC peers are instructed to restart
index=_internal host="SHC members" source=*/splunkd.log* | table _time host | sort _time

iv) Visit then captain and the new captain to find any abnormal activities.

On the search head, searchhead11
05-12-2019 23:18:25.740 -0400 INFO SHCSlave - event=SHPSlave::handleHeartbeatDone master has instructed peer to restart
…
05-12-2019 23:36:37.546 -0400 INFO ShutdownHandler - shutting down level "ShutdownLevel_DFM"
…
05-12-2019 23:41:03.191 -0400 INFO IndexProcessor - handleSignal : Disabling streaming searches.
05-12-2019 23:41:03.191 -0400 INFO SHClusterMgr - Starting to Signal shutdown RAFT
05-12-2019 23:41:03.191 -0400 INFO SHCRaftConsensus - Shutdown signal received.
05-12-2019 23:41:03.191 -0400 INFO SHClusterMgr - Signal shutdown RAFT completed
05-12-2019 23:46:32.401 -0400 ERROR AdminHandler:ServerControl - forcing shutdown since it did not complete in 600.000 seconds
05-12-2019 23:46:50.716 -0400 WARN ConfMetrics - single_action=BASE_INITIALIZE took wallclock_ms=1905
05-12-2019 23:46:50.729 -0400 INFO ServerConfig - My GUID is ABCDEFGH-1234-ABCD-A0A7-E8DA24D123456

v) On the new captain, check the messages from the component "SHCRaftConsensus" to find the raft voting activities. Hopefully that would tell us why the captaincy change happened.
Then captain might have trouble to talk to some members, like below - find the below as an example.

In splunkd.log
05-12-2019 22:36:30.573 -0500 ERROR SHCRaftConsensus - failed appendEntriesRequest err: uri=https://searchhead11:8089/services/shcluster/member/consensus/pseudoid/raft_append_entries?output_mo..., socket_error=Connect Timeout to https://searchhead11:8089
05-12-2019 22:41:47.666 -0500 ERROR SHCRaftConsensus - 70 consecutive appendEntriesFailures to https://searchhead11:8089

Once you understand what happened during the bundle push then you may be able to work out what triggered the issue.

In general the captain changes happens by either captaincy transfer command or by timeouts by raft layer. If you see this kind of issue frequently check "election_timeout_ms" under "shclustering" stanza and increase it to higher than cxn_timeout, rcv_timeout and send_timeout.

View solution in original post

sylim_splunk · ‎08-14-2019