Deployment Architecture

Why is the Search Head Cluster getting Caught in the Restart Loop?

Splunk Employee
Splunk Employee

It has been reported by some that members of a Search Head Cluster (SHC) will sometimes get caught in a restart loop after the Deployer has pushed a out a new configuration bundle.
Here is what happens
1). The Deployer pushes out a new bundle
2). One SHC member restarts
3). Immediately after coming back up, the SHC member restarts again.
4). This member continues to restart until the Deployer is restarted.
5). After the Deployer is bounced, the next member in the cluster get's caught in a restart loop.
6). The Deployer is continually restarted as each member goes through it's own restart loop.
7). The Captain is the last member to restart. Oddly enough, it does not get caught in the restart loop.
It is strange that bringing the Deployer down stops the restart loop since restarts are handled by the Search Head Cluster Captain. Each time the Search Head Member comes back up (within its restart loop), the following is logged into splunkd.log:
<mm-dd-YYYY HH:MM:S.%3N -TMZ> INFO loader - Downloaded new baseline configuration; restarting ...

1 Solution

Splunk Employee
Splunk Employee

What is most likely happening here is that each time the SHC member comes back up, it is performing a checksum against the Deployer and incorrectly assessing that it does not have the current bundle.
This causes the bundle to be redeployed to this SHC Member which then results in another restart. You may find the following log entry in splunkd.log on the Search Head in question:

<mm-dd-YYYY HH:MM:S.%3N -TMZ> ERROR DistributedBundleReplicationManager - HTTP response code 409 (HTTP/1.1 409 Conflict). Checksum mismatch: received copy of bundle="/opt/splunk/var/run/searchpeers/<DeployerServer>-<bundlename>.bundle" has transferred_checksum=0 instead of checksum=<value_other_than_zero>

The reason bringing down the Deployer stops the restart loop is because the checksum matches (0 compared to 0).
The most likely reason for the checksum mismatch is a problematic app installed somewhere in the Cluster that does not allow the checksum to be properly calculated.
You can go through and remove apps one by one to see which of them may be causing the problem or you could prevent the restart loop from occuring at all.
To avoid the restart loop, we can stop the SH member from performing the checksum when it comes back.

server.conf
[shclustering]
conf_deploy_fetch_mode = none

Here is what the spec file for server.conf says about 'confdeployfetch_mode:

conf_deploy_fetch_mode = auto|replace|none
* Controls configuration bundle fetching behavior when the member starts up.
* When set to "replace", a member checks for a new configuration bundle on
every startup.
* When set to "none", a member does not fetch the configuration bundle on
startup.
* Regarding "auto":
* If no configuration bundle has yet been fetched, "auto" is equivalent
to "replace".
* If the configuration bundle has already been fetched, "auto" is
equivalent to "none".
* Defaults to "replace".

As you can see, setting this value to none will prevent the configuration bundle from being fetched thus preventing the restart loop.
Keep in mind that when you make this change you will have to manually push out each config bundle, which is what most admins do anyhow.

View solution in original post

Splunk Employee
Splunk Employee

What is most likely happening here is that each time the SHC member comes back up, it is performing a checksum against the Deployer and incorrectly assessing that it does not have the current bundle.
This causes the bundle to be redeployed to this SHC Member which then results in another restart. You may find the following log entry in splunkd.log on the Search Head in question:

<mm-dd-YYYY HH:MM:S.%3N -TMZ> ERROR DistributedBundleReplicationManager - HTTP response code 409 (HTTP/1.1 409 Conflict). Checksum mismatch: received copy of bundle="/opt/splunk/var/run/searchpeers/<DeployerServer>-<bundlename>.bundle" has transferred_checksum=0 instead of checksum=<value_other_than_zero>

The reason bringing down the Deployer stops the restart loop is because the checksum matches (0 compared to 0).
The most likely reason for the checksum mismatch is a problematic app installed somewhere in the Cluster that does not allow the checksum to be properly calculated.
You can go through and remove apps one by one to see which of them may be causing the problem or you could prevent the restart loop from occuring at all.
To avoid the restart loop, we can stop the SH member from performing the checksum when it comes back.

server.conf
[shclustering]
conf_deploy_fetch_mode = none

Here is what the spec file for server.conf says about 'confdeployfetch_mode:

conf_deploy_fetch_mode = auto|replace|none
* Controls configuration bundle fetching behavior when the member starts up.
* When set to "replace", a member checks for a new configuration bundle on
every startup.
* When set to "none", a member does not fetch the configuration bundle on
startup.
* Regarding "auto":
* If no configuration bundle has yet been fetched, "auto" is equivalent
to "replace".
* If the configuration bundle has already been fetched, "auto" is
equivalent to "none".
* Defaults to "replace".

As you can see, setting this value to none will prevent the configuration bundle from being fetched thus preventing the restart loop.
Keep in mind that when you make this change you will have to manually push out each config bundle, which is what most admins do anyhow.

View solution in original post