Deployment Architecture

Indexer cluster - "rolling-restart" fails frequently

sylim_splunk
Splunk Employee
Splunk Employee

When Cluster Master initiates a restart either by "splunk apply cluster-bundle" or "splunk rolling-restart cluster-peers" many of the indexers fails to restart - the server is told to restart, it shuts down and it never comes back.
Each time we had to log on to the server and manually restart splunk.

1 Solution

sylim_splunk
Splunk Employee
Splunk Employee

i) It appears to be happening to some indexers busy with the jobs running. Splunk will wait 6 minutes for the sub-tasks to complete, then kill them. Then it is likely failing to come back.
This has been fixed in the version 7.2.4+ and will be available soon in https://www.splunk.com/en_us/download.html
As of this writing, the Fixed versions : 7.0.9+, 7.1.6+ and 7.2.4+.

As a workaround you can also avoid the force-shutdown by increasing the parameter, splunkd_stop_timeout in server.conf which is available for the version 7.1.3+.

ii) If you are seeing this issue where it doesn't take longer than 6 mins to stop splunk, then you may want to check if you are using systemd for splunk stop/start and the version is prior to 7.2.2 which doesn't support systemd. If that's the case the unit file should use "RemainAfterExit=True" instead of False.

[Service]
RemainAfterExit=yes

By doing this it will prevent systemd from executing ExecStop so that splunkd can continue the restart procedure. After editing the file you should reload the systemd to pick up the change:

$ systemctl daemon-reload

Systemd is now supported for the versions, 7.2.2+ and then make sure to delete the line, "RemainAfterExit=true/yes". Otherwise it will cause the same issue again.

pre-7.2.2 must have RemainAfterExit = yes in the unit file.
post-7.2.2 must have the line, 'RemainAfterExit = yes' deleted.

View solution in original post

sylim_splunk
Splunk Employee
Splunk Employee

i) It appears to be happening to some indexers busy with the jobs running. Splunk will wait 6 minutes for the sub-tasks to complete, then kill them. Then it is likely failing to come back.
This has been fixed in the version 7.2.4+ and will be available soon in https://www.splunk.com/en_us/download.html
As of this writing, the Fixed versions : 7.0.9+, 7.1.6+ and 7.2.4+.

As a workaround you can also avoid the force-shutdown by increasing the parameter, splunkd_stop_timeout in server.conf which is available for the version 7.1.3+.

ii) If you are seeing this issue where it doesn't take longer than 6 mins to stop splunk, then you may want to check if you are using systemd for splunk stop/start and the version is prior to 7.2.2 which doesn't support systemd. If that's the case the unit file should use "RemainAfterExit=True" instead of False.

[Service]
RemainAfterExit=yes

By doing this it will prevent systemd from executing ExecStop so that splunkd can continue the restart procedure. After editing the file you should reload the systemd to pick up the change:

$ systemctl daemon-reload

Systemd is now supported for the versions, 7.2.2+ and then make sure to delete the line, "RemainAfterExit=true/yes". Otherwise it will cause the same issue again.

pre-7.2.2 must have RemainAfterExit = yes in the unit file.
post-7.2.2 must have the line, 'RemainAfterExit = yes' deleted.

Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...