Splunk Enterprise

SHC rolling restart issue - systemd problem?

lukasmecir
Path Finder

Hi everyone, could someone help me with SHC issue? Problem is: I have SHC with 6 members. Splunk is running as systemd service. Today morning I made rolling restart of SHC (via GUI, searchable option off, force option off) and unfortunately restart got stuck:
Output of splunk rolling-restart shcluster-members -status 1:


 Peer  |  Status  |  Start Time  |  End Time  |  GUID
 1. pbsmsas01.vs.csin.cz | NOT STARTED | N/A | N/A | 0335FD54-853B-4FB4-A77F-3AE80805D272
 2. pbsmsas02.vs.csin.cz | RESTARTING | Tue Sep  8 07:25:07 2020 | N/A | 52AF82EF-7703-4A45-8DAB-80787B630FE4
 3. ppsmsas03.vs.csin.cz | NOT STARTED | N/A | N/A | 7869C19C-8575-42E6-B925-5C34AE036C3E
 4. pbsmsas03.vs.csin.cz | NOT STARTED | N/A | N/A | 8C9148A1-AEC8-499F-BD40-D2A4DB49741C
 5. ppsmsas02.vs.csin.cz | NOT STARTED | N/A | N/A | CAB5B9F2-99F4-4CE1-9E8F-8A108A7AE907
 Server pbsmsas02 is in restarting state for nearly 6 hours now.

In splunkd.log from pbsmsas02.vs.csin.cz I found this:

09-08-2020 07:25:12.962 +0200 WARN Restarter - Splunkd is configured to run as a systemd service, skipping external restart process
09-08-2020 07:25:12.962 +0200 INFO SHCSlave - event=SHPSlave::service detected restart is required, will restart node
09-08-2020 07:25:12.794 +0200 INFO SHCSlave - event=SHPSlave::handleHeartbeatDone master has instructed peer to restart

Output of splunk show shcluster-status looks good:

Captain:
                decommission_search_jobs_wait_secs : 180
                               dynamic_captain : 1
                               elected_captain : Mon Sep  7 08:04:38 2020
                                            id : 3ED24A60-790A-42D2-903B-0C30C6EFDD28
                              initialized_flag : 1
                                         label : ppsmsas01.vs.csin.cz
                 max_failures_to_keep_majority : 1
                                      mgmt_uri : https://ppsmsas01.vs.csin.cz:8089
                         min_peers_joined_flag : 1
                               rolling_restart : restart
                          rolling_restart_flag : 1
                          rolling_upgrade_flag : 0
                            service_ready_flag : 1
                                stable_captain : 1

 Cluster Master(s):
        https://splunk-master.csin.cz:8089              splunk_version: 8.0.4.1

 Members:
        pbsmsas01.vs.csin.cz
                                         label : pbsmsas01.vs.csin.cz
                         last_conf_replication : Tue Sep  8 12:28:09 2020
                              manual_detention : off
                                      mgmt_uri : https://pbsmsas01.vs.csin.cz:8089
                                mgmt_uri_alias : https://10.177.155.49:8089
                              out_of_sync_node : 0
                             preferred_captain : 1
                              restart_required : 1
                                splunk_version : 8.0.4.1
                                        status : Up
        ppsmsas01.vs.csin.cz
                                         label : ppsmsas01.vs.csin.cz
                              manual_detention : off
                                      mgmt_uri : https://ppsmsas01.vs.csin.cz:8089
                                mgmt_uri_alias : https://10.177.155.48:8089
                              out_of_sync_node : 0
                             preferred_captain : 1
                              restart_required : 0
                                splunk_version : 8.0.4.1
                                        status : Up
        pbsmsas02.vs.csin.cz
                                         label : pbsmsas02.vs.csin.cz
                         last_conf_replication : Tue Sep  8 12:28:09 2020
                              manual_detention : off
                                      mgmt_uri : https://pbsmsas02.vs.csin.cz:8089
                                mgmt_uri_alias : https://10.177.155.51:8089
                              out_of_sync_node : 0
                             preferred_captain : 1
                              restart_required : 1
                                splunk_version : 8.0.4.1
                                        status : Restarting
        ppsmsas03.vs.csin.cz
                                         label : ppsmsas03.vs.csin.cz
                         last_conf_replication : Tue Sep  8 12:28:09 2020
                              manual_detention : off
                                      mgmt_uri : https://ppsmsas03.vs.csin.cz:8089
                                mgmt_uri_alias : https://10.177.155.52:8089
                              out_of_sync_node : 0
                             preferred_captain : 1
                              restart_required : 1
                                splunk_version : 8.0.4.1
                                        status : Up
        pbsmsas03.vs.csin.cz
                                         label : pbsmsas03.vs.csin.cz
                         last_conf_replication : Tue Sep  8 12:28:08 2020
                              manual_detention : off
                                      mgmt_uri : https://pbsmsas03.vs.csin.cz:8089
                                mgmt_uri_alias : https://10.177.155.53:8089
                              out_of_sync_node : 0
                             preferred_captain : 1
                              restart_required : 1
                                splunk_version : 8.0.4.1
                                        status : Up
        ppsmsas02.vs.csin.cz
                                         label : ppsmsas02.vs.csin.cz
                         last_conf_replication : Tue Sep  8 12:28:08 2020
                              manual_detention : off
                                      mgmt_uri : https://ppsmsas02.vs.csin.cz:8089
                                mgmt_uri_alias : https://10.177.155.50:8089
                              out_of_sync_node : 0
                             preferred_captain : 1
                              restart_required : 1
                                splunk_version : 8.0.4.1
                                        status : Up

What is strange, I made this rolling restart many times before and never had a problem. Could you please someone advise what to do now? Is it safe manually restart problematic server? Or there is another solution? Thank you very much.

0 Karma

dvbeekcinq
New Member

@lukasmecir have you found the root cause of the issue with Splunk support?

We are experiencing this issue for a while now, always restarting splunk manually. But it would be nice if we could fix it for future rolling restarts.

0 Karma

lukasmecir
Path Finder

@dvbeekcinq as far as I remember, only recommendation from Splunk support was "restart it manually"... Fortunately, it helped... In fact, there were many rolling restarts on this particular SHC and only this one failed, all remaining restarts were fine.

0 Karma

thambisetty
SplunkTrust
SplunkTrust

@lukasmecir  Raise a splunk support, its very difficult here to propose solution without diag.

————————————
If this helps, give a like below.
0 Karma

lukasmecir
Path Finder

OK, I will raise support ticket, but it will take some time to find solution... What is your opinion about manual restart of problematic instance to get out of this stuck? Is it safe?

0 Karma

isoutamo
SplunkTrust
SplunkTrust
Usually support answer quite soon when you have created P1 ticket. Probably there haven’t any issues by restart, but if support needs some information I prefer to wait until they respond.
r. Ismo
0 Karma

lukasmecir
Path Finder

OK, I will hope you are right with quick answer 🙂

0 Karma

isoutamo
SplunkTrust
SplunkTrust
This should work as you are expecting. I propose that you should create splunk support case to figure out if this is bug.
r. Ismo
0 Karma

thambisetty
SplunkTrust
SplunkTrust

what about cluster status now?

is it up and running other members?

how many are up and how many are down ?

————————————
If this helps, give a like below.
0 Karma

lukasmecir
Path Finder

As you can see on splunk show shcluster-status --verbose output:

All members (except restarting one) are up - in other words, 5 members are up and 1 is restarting.

Cluster is running, i can make searches etc. as usual.

0 Karma
Get Updates on the Splunk Community!

Continuing Innovation & New Integrations Unlock Full Stack Observability For Your ...

You’ve probably heard the latest about AppDynamics joining the Splunk Observability portfolio, deepening our ...

Monitoring Amazon Elastic Kubernetes Service (EKS)

As we’ve seen, integrating Kubernetes environments with Splunk Observability Cloud is a quick and easy way to ...

Cloud Platform & Enterprise: Classic Dashboard Export Feature Deprecation

As of Splunk Cloud Platform 9.3.2408 and Splunk Enterprise 9.4, classic dashboard export features are now ...