Solved: Why is the Search Head Cluster getting Caught in t...

cstump_splunk · ‎04-16-2018

It has been reported by some that members of a Search Head Cluster (SHC) will sometimes get caught in a restart loop after the Deployer has pushed a out a new configuration bundle.
Here is what happens
1). The Deployer pushes out a new bundle
2). One SHC member restarts
3). Immediately after coming back up, the SHC member restarts again.
4). This member continues to restart until the Deployer is restarted.
5). After the Deployer is bounced, the next member in the cluster get's caught in a restart loop.
6). The Deployer is continually restarted as each member goes through it's own restart loop.
7). The Captain is the last member to restart. Oddly enough, it does not get caught in the restart loop.
It is strange that bringing the Deployer down stops the restart loop since restarts are handled by the Search Head Cluster Captain. Each time the Search Head Member comes back up (within its restart loop), the following is logged into splunkd.log:
<mm-dd-YYYY HH:MM:S.%3N -TMZ> INFO loader - Downloaded new baseline configuration; restarting ...

cstump_splunk · ‎04-16-2018

What is most likely happening here is that each time the SHC member comes back up, it is performing a checksum against the Deployer and incorrectly assessing that it does not have the current bundle.
This causes the bundle to be redeployed to this SHC Member which then results in another restart. You may find the following log entry in splunkd.log on the Search Head in question:

<mm-dd-YYYY HH:MM:S.%3N -TMZ> ERROR DistributedBundleReplicationManager - HTTP response code 409 (HTTP/1.1 409 Conflict). Checksum mismatch: received copy of bundle="/opt/splunk/var/run/searchpeers/<DeployerServer>-<bundlename>.bundle" has transferred_checksum=0 instead of checksum=<value_other_than_zero>

The reason bringing down the Deployer stops the restart loop is because the checksum matches (0 compared to 0).
The most likely reason for the checksum mismatch is a problematic app installed somewhere in the Cluster that does not allow the checksum to be properly calculated.
You can go through and remove apps one by one to see which of them may be causing the problem or you could prevent the restart loop from occuring at all.
To avoid the restart loop, we can stop the SH member from performing the checksum when it comes back.

server.conf [shclustering] conf_deploy_fetch_mode = none

Here is what the spec file for server.conf says about 'conf_deploy_fetch_mode:

conf_deploy_fetch_mode = auto|replace|none * Controls configuration bundle fetching behavior when the member starts up. * When set to "replace", a member checks for a new configuration bundle on every startup. * When set to "none", a member does not fetch the configuration bundle on startup. * Regarding "auto": * If no configuration bundle has yet been fetched, "auto" is equivalent to "replace". * If the configuration bundle has already been fetched, "auto" is equivalent to "none". * Defaults to "replace".
As you can see, setting this value to none will prevent the configuration bundle from being fetched thus preventing the restart loop.
Keep in mind that when you make this change you will have to manually push out each config bundle, which is what most admins do anyhow.

View solution in original post

cstump_splunk · ‎04-16-2018

What is most likely happening here is that each time the SHC member comes back up, it is performing a checksum against the Deployer and incorrectly assessing that it does not have the current bundle.
This causes the bundle to be redeployed to this SHC Member which then results in another restart. You may find the following log entry in splunkd.log on the Search Head in question:

<mm-dd-YYYY HH:MM:S.%3N -TMZ> ERROR DistributedBundleReplicationManager - HTTP response code 409 (HTTP/1.1 409 Conflict). Checksum mismatch: received copy of bundle="/opt/splunk/var/run/searchpeers/<DeployerServer>-<bundlename>.bundle" has transferred_checksum=0 instead of checksum=<value_other_than_zero>

The reason bringing down the Deployer stops the restart loop is because the checksum matches (0 compared to 0).
The most likely reason for the checksum mismatch is a problematic app installed somewhere in the Cluster that does not allow the checksum to be properly calculated.
You can go through and remove apps one by one to see which of them may be causing the problem or you could prevent the restart loop from occuring at all.
To avoid the restart loop, we can stop the SH member from performing the checksum when it comes back.

server.conf [shclustering] conf_deploy_fetch_mode = none

Here is what the spec file for server.conf says about 'conf_deploy_fetch_mode:

conf_deploy_fetch_mode = auto|replace|none * Controls configuration bundle fetching behavior when the member starts up. * When set to "replace", a member checks for a new configuration bundle on every startup. * When set to "none", a member does not fetch the configuration bundle on startup. * Regarding "auto": * If no configuration bundle has yet been fetched, "auto" is equivalent to "replace". * If the configuration bundle has already been fetched, "auto" is equivalent to "none". * Defaults to "replace".
As you can see, setting this value to none will prevent the configuration bundle from being fetched thus preventing the restart loop.
Keep in mind that when you make this change you will have to manually push out each config bundle, which is what most admins do anyhow.

Why is the Search Head Cluster getting Caught in the Restart Loop?

[Puzzles] Solve, Learn, Repeat: Dynamic formatting from XML events

Enter the Agentic Era with Splunk AI Assistant for SPL 1.4

Stronger Security with Federated Search for S3, GCP SQL & Australian Threat ...

Join the Conversation

Why is the Search Head Cluster getting Caught in the Restart Loop?

[Puzzles] Solve, Learn, Repeat: Dynamic formatting from XML events

Enter the Agentic Era with Splunk AI Assistant for SPL 1.4

Stronger Security with Federated Search for S3, GCP SQL & Australian Threat ...