Deployment Architecture

Indexer cluster - "rolling-restart" fails frequently

sylim_splunk
Splunk Employee
Splunk Employee

When Cluster Master initiates a restart either by "splunk apply cluster-bundle" or "splunk rolling-restart cluster-peers" many of the indexers fails to restart - the server is told to restart, it shuts down and it never comes back.
Each time we had to log on to the server and manually restart splunk.

1 Solution

sylim_splunk
Splunk Employee
Splunk Employee

i) It appears to be happening to some indexers busy with the jobs running. Splunk will wait 6 minutes for the sub-tasks to complete, then kill them. Then it is likely failing to come back.
This has been fixed in the version 7.2.4+ and will be available soon in https://www.splunk.com/en_us/download.html
As of this writing, the Fixed versions : 7.0.9+, 7.1.6+ and 7.2.4+.

As a workaround you can also avoid the force-shutdown by increasing the parameter, splunkd_stop_timeout in server.conf which is available for the version 7.1.3+.

ii) If you are seeing this issue where it doesn't take longer than 6 mins to stop splunk, then you may want to check if you are using systemd for splunk stop/start and the version is prior to 7.2.2 which doesn't support systemd. If that's the case the unit file should use "RemainAfterExit=True" instead of False.

[Service]
RemainAfterExit=yes

By doing this it will prevent systemd from executing ExecStop so that splunkd can continue the restart procedure. After editing the file you should reload the systemd to pick up the change:

$ systemctl daemon-reload

Systemd is now supported for the versions, 7.2.2+ and then make sure to delete the line, "RemainAfterExit=true/yes". Otherwise it will cause the same issue again.

pre-7.2.2 must have RemainAfterExit = yes in the unit file.
post-7.2.2 must have the line, 'RemainAfterExit = yes' deleted.

View solution in original post

sylim_splunk
Splunk Employee
Splunk Employee

i) It appears to be happening to some indexers busy with the jobs running. Splunk will wait 6 minutes for the sub-tasks to complete, then kill them. Then it is likely failing to come back.
This has been fixed in the version 7.2.4+ and will be available soon in https://www.splunk.com/en_us/download.html
As of this writing, the Fixed versions : 7.0.9+, 7.1.6+ and 7.2.4+.

As a workaround you can also avoid the force-shutdown by increasing the parameter, splunkd_stop_timeout in server.conf which is available for the version 7.1.3+.

ii) If you are seeing this issue where it doesn't take longer than 6 mins to stop splunk, then you may want to check if you are using systemd for splunk stop/start and the version is prior to 7.2.2 which doesn't support systemd. If that's the case the unit file should use "RemainAfterExit=True" instead of False.

[Service]
RemainAfterExit=yes

By doing this it will prevent systemd from executing ExecStop so that splunkd can continue the restart procedure. After editing the file you should reload the systemd to pick up the change:

$ systemctl daemon-reload

Systemd is now supported for the versions, 7.2.2+ and then make sure to delete the line, "RemainAfterExit=true/yes". Otherwise it will cause the same issue again.

pre-7.2.2 must have RemainAfterExit = yes in the unit file.
post-7.2.2 must have the line, 'RemainAfterExit = yes' deleted.

Get Updates on the Splunk Community!

Detecting Remote Code Executions With the Splunk Threat Research Team

WATCH NOWRemote code execution (RCE) vulnerabilities pose a significant risk to organizations. If exploited, ...

Enter the Splunk Community Dashboard Challenge for Your Chance to Win!

The Splunk Community Dashboard Challenge is underway! This is your chance to showcase your skills in creating ...

.conf24 | Session Scheduler is Live!!

.conf24 is happening June 11 - 14 in Las Vegas, and we are thrilled to announce that the conference catalog ...