Deployment Architecture

Indexer cluster - "rolling-restart" fails frequently

sylim_splunk
Splunk Employee
Splunk Employee

When Cluster Master initiates a restart either by "splunk apply cluster-bundle" or "splunk rolling-restart cluster-peers" many of the indexers fails to restart - the server is told to restart, it shuts down and it never comes back.
Each time we had to log on to the server and manually restart splunk.

1 Solution

sylim_splunk
Splunk Employee
Splunk Employee

i) It appears to be happening to some indexers busy with the jobs running. Splunk will wait 6 minutes for the sub-tasks to complete, then kill them. Then it is likely failing to come back.
This has been fixed in the version 7.2.4+ and will be available soon in https://www.splunk.com/en_us/download.html
As of this writing, the Fixed versions : 7.0.9+, 7.1.6+ and 7.2.4+.

As a workaround you can also avoid the force-shutdown by increasing the parameter, splunkd_stop_timeout in server.conf which is available for the version 7.1.3+.

ii) If you are seeing this issue where it doesn't take longer than 6 mins to stop splunk, then you may want to check if you are using systemd for splunk stop/start and the version is prior to 7.2.2 which doesn't support systemd. If that's the case the unit file should use "RemainAfterExit=True" instead of False.

[Service]
RemainAfterExit=yes

By doing this it will prevent systemd from executing ExecStop so that splunkd can continue the restart procedure. After editing the file you should reload the systemd to pick up the change:

$ systemctl daemon-reload

Systemd is now supported for the versions, 7.2.2+ and then make sure to delete the line, "RemainAfterExit=true/yes". Otherwise it will cause the same issue again.

pre-7.2.2 must have RemainAfterExit = yes in the unit file.
post-7.2.2 must have the line, 'RemainAfterExit = yes' deleted.

View solution in original post

sylim_splunk
Splunk Employee
Splunk Employee

i) It appears to be happening to some indexers busy with the jobs running. Splunk will wait 6 minutes for the sub-tasks to complete, then kill them. Then it is likely failing to come back.
This has been fixed in the version 7.2.4+ and will be available soon in https://www.splunk.com/en_us/download.html
As of this writing, the Fixed versions : 7.0.9+, 7.1.6+ and 7.2.4+.

As a workaround you can also avoid the force-shutdown by increasing the parameter, splunkd_stop_timeout in server.conf which is available for the version 7.1.3+.

ii) If you are seeing this issue where it doesn't take longer than 6 mins to stop splunk, then you may want to check if you are using systemd for splunk stop/start and the version is prior to 7.2.2 which doesn't support systemd. If that's the case the unit file should use "RemainAfterExit=True" instead of False.

[Service]
RemainAfterExit=yes

By doing this it will prevent systemd from executing ExecStop so that splunkd can continue the restart procedure. After editing the file you should reload the systemd to pick up the change:

$ systemctl daemon-reload

Systemd is now supported for the versions, 7.2.2+ and then make sure to delete the line, "RemainAfterExit=true/yes". Otherwise it will cause the same issue again.

pre-7.2.2 must have RemainAfterExit = yes in the unit file.
post-7.2.2 must have the line, 'RemainAfterExit = yes' deleted.

Get Updates on the Splunk Community!

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...

Let’s Get You Certified – Vegas-Style at .conf24

Are you ready to level up your Splunk game? Then, let’s get you certified live at .conf24 – our annual user ...