Hi everyone, could someone help me with SHC issue? Problem is: I have SHC with 6 members. Splunk is running as systemd service. Today morning I made rolling restart of SHC (via GUI, searchable option off, force option off) and unfortunately restart got stuck:
Output of splunk rolling-restart shcluster-members -status 1:
Peer | Status | Start Time | End Time | GUID
1. pbsmsas01.vs.csin.cz | NOT STARTED | N/A | N/A | 0335FD54-853B-4FB4-A77F-3AE80805D272
2. pbsmsas02.vs.csin.cz | RESTARTING | Tue Sep 8 07:25:07 2020 | N/A | 52AF82EF-7703-4A45-8DAB-80787B630FE4
3. ppsmsas03.vs.csin.cz | NOT STARTED | N/A | N/A | 7869C19C-8575-42E6-B925-5C34AE036C3E
4. pbsmsas03.vs.csin.cz | NOT STARTED | N/A | N/A | 8C9148A1-AEC8-499F-BD40-D2A4DB49741C
5. ppsmsas02.vs.csin.cz | NOT STARTED | N/A | N/A | CAB5B9F2-99F4-4CE1-9E8F-8A108A7AE907
Server pbsmsas02 is in restarting state for nearly 6 hours now.
In splunkd.log from pbsmsas02.vs.csin.cz I found this:
09-08-2020 07:25:12.962 +0200 WARN Restarter - Splunkd is configured to run as a systemd service, skipping external restart process
09-08-2020 07:25:12.962 +0200 INFO SHCSlave - event=SHPSlave::service detected restart is required, will restart node
09-08-2020 07:25:12.794 +0200 INFO SHCSlave - event=SHPSlave::handleHeartbeatDone master has instructed peer to restart
Output of splunk show shcluster-status looks good:
Captain:
decommission_search_jobs_wait_secs : 180
dynamic_captain : 1
elected_captain : Mon Sep 7 08:04:38 2020
id : 3ED24A60-790A-42D2-903B-0C30C6EFDD28
initialized_flag : 1
label : ppsmsas01.vs.csin.cz
max_failures_to_keep_majority : 1
mgmt_uri : https://ppsmsas01.vs.csin.cz:8089
min_peers_joined_flag : 1
rolling_restart : restart
rolling_restart_flag : 1
rolling_upgrade_flag : 0
service_ready_flag : 1
stable_captain : 1
Cluster Master(s):
https://splunk-master.csin.cz:8089 splunk_version: 8.0.4.1
Members:
pbsmsas01.vs.csin.cz
label : pbsmsas01.vs.csin.cz
last_conf_replication : Tue Sep 8 12:28:09 2020
manual_detention : off
mgmt_uri : https://pbsmsas01.vs.csin.cz:8089
mgmt_uri_alias : https://10.177.155.49:8089
out_of_sync_node : 0
preferred_captain : 1
restart_required : 1
splunk_version : 8.0.4.1
status : Up
ppsmsas01.vs.csin.cz
label : ppsmsas01.vs.csin.cz
manual_detention : off
mgmt_uri : https://ppsmsas01.vs.csin.cz:8089
mgmt_uri_alias : https://10.177.155.48:8089
out_of_sync_node : 0
preferred_captain : 1
restart_required : 0
splunk_version : 8.0.4.1
status : Up
pbsmsas02.vs.csin.cz
label : pbsmsas02.vs.csin.cz
last_conf_replication : Tue Sep 8 12:28:09 2020
manual_detention : off
mgmt_uri : https://pbsmsas02.vs.csin.cz:8089
mgmt_uri_alias : https://10.177.155.51:8089
out_of_sync_node : 0
preferred_captain : 1
restart_required : 1
splunk_version : 8.0.4.1
status : Restarting
ppsmsas03.vs.csin.cz
label : ppsmsas03.vs.csin.cz
last_conf_replication : Tue Sep 8 12:28:09 2020
manual_detention : off
mgmt_uri : https://ppsmsas03.vs.csin.cz:8089
mgmt_uri_alias : https://10.177.155.52:8089
out_of_sync_node : 0
preferred_captain : 1
restart_required : 1
splunk_version : 8.0.4.1
status : Up
pbsmsas03.vs.csin.cz
label : pbsmsas03.vs.csin.cz
last_conf_replication : Tue Sep 8 12:28:08 2020
manual_detention : off
mgmt_uri : https://pbsmsas03.vs.csin.cz:8089
mgmt_uri_alias : https://10.177.155.53:8089
out_of_sync_node : 0
preferred_captain : 1
restart_required : 1
splunk_version : 8.0.4.1
status : Up
ppsmsas02.vs.csin.cz
label : ppsmsas02.vs.csin.cz
last_conf_replication : Tue Sep 8 12:28:08 2020
manual_detention : off
mgmt_uri : https://ppsmsas02.vs.csin.cz:8089
mgmt_uri_alias : https://10.177.155.50:8089
out_of_sync_node : 0
preferred_captain : 1
restart_required : 1
splunk_version : 8.0.4.1
status : Up
What is strange, I made this rolling restart many times before and never had a problem. Could you please someone advise what to do now? Is it safe manually restart problematic server? Or there is another solution? Thank you very much.
@lukasmecir have you found the root cause of the issue with Splunk support?
We are experiencing this issue for a while now, always restarting splunk manually. But it would be nice if we could fix it for future rolling restarts.
@dvbeekcinq as far as I remember, only recommendation from Splunk support was "restart it manually"... Fortunately, it helped... In fact, there were many rolling restarts on this particular SHC and only this one failed, all remaining restarts were fine.
@lukasmecir Raise a splunk support, its very difficult here to propose solution without diag.
OK, I will raise support ticket, but it will take some time to find solution... What is your opinion about manual restart of problematic instance to get out of this stuck? Is it safe?
OK, I will hope you are right with quick answer 🙂
what about cluster status now?
is it up and running other members?
how many are up and how many are down ?
As you can see on splunk show shcluster-status --verbose output:
All members (except restarting one) are up - in other words, 5 members are up and 1 is restarting.
Cluster is running, i can make searches etc. as usual.