SHC rolling restart issue - systemd problem?

lukasmecir · ‎09-08-2020

Hi everyone, could someone help me with SHC issue? Problem is: I have SHC with 6 members. Splunk is running as systemd service. Today morning I made rolling restart of SHC (via GUI, searchable option off, force option off) and unfortunately restart got stuck:
Output of splunk rolling-restart shcluster-members -status 1:

In splunkd.log from pbsmsas02.vs.csin.cz I found this:

09-08-2020 07:25:12.962 +0200 WARN Restarter - Splunkd is configured to run as a systemd service, skipping external restart process
09-08-2020 07:25:12.962 +0200 INFO SHCSlave - event=SHPSlave::service detected restart is required, will restart node
09-08-2020 07:25:12.794 +0200 INFO SHCSlave - event=SHPSlave::handleHeartbeatDone master has instructed peer to restart

Output of splunk show shcluster-status looks good:

Captain:
decommission_search_jobs_wait_secs : 180
dynamic_captain : 1
elected_captain : Mon Sep 7 08:04:38 2020
id : 3ED24A60-790A-42D2-903B-0C30C6EFDD28
initialized_flag : 1
label : ppsmsas01.vs.csin.cz
max_failures_to_keep_majority : 1
mgmt_uri : https://ppsmsas01.vs.csin.cz:8089
min_peers_joined_flag : 1
rolling_restart : restart
rolling_restart_flag : 1
rolling_upgrade_flag : 0
service_ready_flag : 1
stable_captain : 1

Cluster Master(s):
https://splunk-master.csin.cz:8089 splunk_version: 8.0.4.1

Members:
pbsmsas01.vs.csin.cz
label : pbsmsas01.vs.csin.cz
last_conf_replication : Tue Sep 8 12:28:09 2020
manual_detention : off
mgmt_uri : https://pbsmsas01.vs.csin.cz:8089
mgmt_uri_alias : https://10.177.155.49:8089
out_of_sync_node : 0
preferred_captain : 1
restart_required : 1
splunk_version : 8.0.4.1
status : Up
ppsmsas01.vs.csin.cz
label : ppsmsas01.vs.csin.cz
manual_detention : off
mgmt_uri : https://ppsmsas01.vs.csin.cz:8089
mgmt_uri_alias : https://10.177.155.48:8089
out_of_sync_node : 0
preferred_captain : 1
restart_required : 0
splunk_version : 8.0.4.1
status : Up
pbsmsas02.vs.csin.cz
label : pbsmsas02.vs.csin.cz
last_conf_replication : Tue Sep 8 12:28:09 2020
manual_detention : off
mgmt_uri : https://pbsmsas02.vs.csin.cz:8089
mgmt_uri_alias : https://10.177.155.51:8089
out_of_sync_node : 0
preferred_captain : 1
restart_required : 1
splunk_version : 8.0.4.1
status : Restarting
ppsmsas03.vs.csin.cz
label : ppsmsas03.vs.csin.cz
last_conf_replication : Tue Sep 8 12:28:09 2020
manual_detention : off
mgmt_uri : https://ppsmsas03.vs.csin.cz:8089
mgmt_uri_alias : https://10.177.155.52:8089
out_of_sync_node : 0
preferred_captain : 1
restart_required : 1
splunk_version : 8.0.4.1
status : Up
pbsmsas03.vs.csin.cz
label : pbsmsas03.vs.csin.cz
last_conf_replication : Tue Sep 8 12:28:08 2020
manual_detention : off
mgmt_uri : https://pbsmsas03.vs.csin.cz:8089
mgmt_uri_alias : https://10.177.155.53:8089
out_of_sync_node : 0
preferred_captain : 1
restart_required : 1
splunk_version : 8.0.4.1
status : Up
ppsmsas02.vs.csin.cz
label : ppsmsas02.vs.csin.cz
last_conf_replication : Tue Sep 8 12:28:08 2020
manual_detention : off
mgmt_uri : https://ppsmsas02.vs.csin.cz:8089
mgmt_uri_alias : https://10.177.155.50:8089
out_of_sync_node : 0
preferred_captain : 1
restart_required : 1
splunk_version : 8.0.4.1
status : Up

What is strange, I made this rolling restart many times before and never had a problem. Could you please someone advise what to do now? Is it safe manually restart problematic server? Or there is another solution? Thank you very much.

dvbeekcinq · ‎02-16-2021

@lukasmecir have you found the root cause of the issue with Splunk support?

We are experiencing this issue for a while now, always restarting splunk manually. But it would be nice if we could fix it for future rolling restarts.

lukasmecir · ‎02-16-2021

@dvbeekcinq as far as I remember, only recommendation from Splunk support was "restart it manually"... Fortunately, it helped... In fact, there were many rolling restarts on this particular SHC and only this one failed, all remaining restarts were fine.

thambisetty · ‎09-08-2020

@lukasmecir Raise a splunk support, its very difficult here to propose solution without diag.

————————————
If this helps, give a like below.

lukasmecir · ‎09-08-2020

OK, I will raise support ticket, but it will take some time to find solution... What is your opinion about manual restart of problematic instance to get out of this stuck? Is it safe?

isoutamo · ‎09-08-2020

Usually support answer quite soon when you have created P1 ticket. Probably there haven’t any issues by restart, but if support needs some information I prefer to wait until they respond.
r. Ismo

lukasmecir · ‎09-08-2020

OK, I will hope you are right with quick answer 🙂

isoutamo · ‎09-08-2020

This should work as you are expecting. I propose that you should create splunk support case to figure out if this is bug.
r. Ismo

thambisetty · ‎09-08-2020

what about cluster status now?

is it up and running other members?

how many are up and how many are down ?

————————————
If this helps, give a like below.

lukasmecir · ‎09-08-2020

As you can see on splunk show shcluster-status --verbose output:

All members (except restarting one) are up - in other words, 5 members are up and 1 is restarting.

Cluster is running, i can make searches etc. as usual.

SHC rolling restart issue - systemd problem?

administration

troubleshooting

using Splunk Enterprise

Continuing Innovation & New Integrations Unlock Full Stack Observability For Your ...

Monitoring Amazon Elastic Kubernetes Service (EKS)

Cloud Platform & Enterprise: Classic Dashboard Export Feature Deprecation