Deployment Architecture

Indexer cluster - "rolling-restart" fails frequently

sylim_splunk
Splunk Employee
Splunk Employee

When Cluster Master initiates a restart either by "splunk apply cluster-bundle" or "splunk rolling-restart cluster-peers" many of the indexers fails to restart - the server is told to restart, it shuts down and it never comes back.
Each time we had to log on to the server and manually restart splunk.

1 Solution

sylim_splunk
Splunk Employee
Splunk Employee

i) It appears to be happening to some indexers busy with the jobs running. Splunk will wait 6 minutes for the sub-tasks to complete, then kill them. Then it is likely failing to come back.
This has been fixed in the version 7.2.4+ and will be available soon in https://www.splunk.com/en_us/download.html
As of this writing, the Fixed versions : 7.0.9+, 7.1.6+ and 7.2.4+.

As a workaround you can also avoid the force-shutdown by increasing the parameter, splunkd_stop_timeout in server.conf which is available for the version 7.1.3+.

ii) If you are seeing this issue where it doesn't take longer than 6 mins to stop splunk, then you may want to check if you are using systemd for splunk stop/start and the version is prior to 7.2.2 which doesn't support systemd. If that's the case the unit file should use "RemainAfterExit=True" instead of False.

[Service]
RemainAfterExit=yes

By doing this it will prevent systemd from executing ExecStop so that splunkd can continue the restart procedure. After editing the file you should reload the systemd to pick up the change:

$ systemctl daemon-reload

Systemd is now supported for the versions, 7.2.2+ and then make sure to delete the line, "RemainAfterExit=true/yes". Otherwise it will cause the same issue again.

pre-7.2.2 must have RemainAfterExit = yes in the unit file.
post-7.2.2 must have the line, 'RemainAfterExit = yes' deleted.

View solution in original post

sylim_splunk
Splunk Employee
Splunk Employee

i) It appears to be happening to some indexers busy with the jobs running. Splunk will wait 6 minutes for the sub-tasks to complete, then kill them. Then it is likely failing to come back.
This has been fixed in the version 7.2.4+ and will be available soon in https://www.splunk.com/en_us/download.html
As of this writing, the Fixed versions : 7.0.9+, 7.1.6+ and 7.2.4+.

As a workaround you can also avoid the force-shutdown by increasing the parameter, splunkd_stop_timeout in server.conf which is available for the version 7.1.3+.

ii) If you are seeing this issue where it doesn't take longer than 6 mins to stop splunk, then you may want to check if you are using systemd for splunk stop/start and the version is prior to 7.2.2 which doesn't support systemd. If that's the case the unit file should use "RemainAfterExit=True" instead of False.

[Service]
RemainAfterExit=yes

By doing this it will prevent systemd from executing ExecStop so that splunkd can continue the restart procedure. After editing the file you should reload the systemd to pick up the change:

$ systemctl daemon-reload

Systemd is now supported for the versions, 7.2.2+ and then make sure to delete the line, "RemainAfterExit=true/yes". Otherwise it will cause the same issue again.

pre-7.2.2 must have RemainAfterExit = yes in the unit file.
post-7.2.2 must have the line, 'RemainAfterExit = yes' deleted.

Get Updates on the Splunk Community!

Welcome to the Splunk Community!

(view in My Videos) We're so glad you're here! The Splunk Community is place to connect, learn, give back, and ...

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Elevating Digital Service Excellence: The Synergy of Real User Monitoring and Application Performance ...

Adoption of RUM and APM at Splunk

    Unleash the power of Splunk Observability   Watch Now In this can't miss Tech Talk! The Splunk Growth ...