Deployment Architecture

Indexer cluster - "rolling-restart" fails frequently

Splunk Employee
Splunk Employee

When Cluster Master initiates a restart either by "splunk apply cluster-bundle" or "splunk rolling-restart cluster-peers" many of the indexers fails to restart - the server is told to restart, it shuts down and it never comes back.
Each time we had to log on to the server and manually restart splunk.

1 Solution

Splunk Employee
Splunk Employee

i) It appears to be happening to some indexers busy with the jobs running. Splunk will wait 6 minutes for the sub-tasks to complete, then kill them. Then it is likely failing to come back.
This has been fixed in the version 7.2.4+ and will be available soon in https://www.splunk.com/en_us/download.html
As of this writing, the Fixed versions : 7.0.9+, 7.1.6+ and 7.2.4+.

As a workaround you can also avoid the force-shutdown by increasing the parameter, splunkd_stop_timeout in server.conf which is available for the version 7.1.3+.

ii) If you are seeing this issue where it doesn't take longer than 6 mins to stop splunk, then you may want to check if you are using systemd for splunk stop/start and the version is prior to 7.2.2 which doesn't support systemd. If that's the case the unit file should use "RemainAfterExit=True" instead of False.

[Service]
RemainAfterExit=yes

By doing this it will prevent systemd from executing ExecStop so that splunkd can continue the restart procedure. After editing the file you should reload the systemd to pick up the change:

$ systemctl daemon-reload

Systemd is now supported for the versions, 7.2.2+ and then make sure to delete the line, "RemainAfterExit=true/yes". Otherwise it will cause the same issue again.

pre-7.2.2 must have RemainAfterExit = yes in the unit file.
post-7.2.2 must have the line, 'RemainAfterExit = yes' deleted.

View solution in original post

Splunk Employee
Splunk Employee

i) It appears to be happening to some indexers busy with the jobs running. Splunk will wait 6 minutes for the sub-tasks to complete, then kill them. Then it is likely failing to come back.
This has been fixed in the version 7.2.4+ and will be available soon in https://www.splunk.com/en_us/download.html
As of this writing, the Fixed versions : 7.0.9+, 7.1.6+ and 7.2.4+.

As a workaround you can also avoid the force-shutdown by increasing the parameter, splunkd_stop_timeout in server.conf which is available for the version 7.1.3+.

ii) If you are seeing this issue where it doesn't take longer than 6 mins to stop splunk, then you may want to check if you are using systemd for splunk stop/start and the version is prior to 7.2.2 which doesn't support systemd. If that's the case the unit file should use "RemainAfterExit=True" instead of False.

[Service]
RemainAfterExit=yes

By doing this it will prevent systemd from executing ExecStop so that splunkd can continue the restart procedure. After editing the file you should reload the systemd to pick up the change:

$ systemctl daemon-reload

Systemd is now supported for the versions, 7.2.2+ and then make sure to delete the line, "RemainAfterExit=true/yes". Otherwise it will cause the same issue again.

pre-7.2.2 must have RemainAfterExit = yes in the unit file.
post-7.2.2 must have the line, 'RemainAfterExit = yes' deleted.

View solution in original post

State of Splunk Careers

Access the Splunk Careers Report to see real data that shows how Splunk mastery increases your value and job satisfaction.

Find out what your skills are worth!