Hi all,
Upon provisioning a new environment (where we don’t have SSH access), we’ve noticed 2 errors in the messages section of Splunk Web:
Error pulling configurations from the search head cluster captain (https://xx.x.xx.19:8089); consider performing a destructive configuration resync on this search head cluster member (on instances xx.x.xx.21, xx.x.xx.20)
The search head cluster captain (https://xx.x.xx.21:8089 (https://xx.x.xx.21:8089/)) is disconnected; skipping configuration replication (on xx.x.xx.19)
In the splunkd.log we see another error:
12-11-2015 15:17:01.177 +0000 ERROR SHPRepJob - failed job=SHPDelegateSearchJob guid=xxxxxxxxxxxxxxxxxxxxxx hp=xx.x.xx.20:8089 saved_search=system;; err=error accessing https://xx.x.57.20:8089/servicesNS/admin/splunk_management_console/shcluster/member/delegatejob/DMC%..., statusCode=404, description=Not Found
We’ve recreated the search head cluster with new instances several times. Sometimes the first error occurs on only 1 instance and the 2nd error doesn’t occur at all. They appear almost immediately after the 3rd search head is done with provisioning.
As we don’t have SSH access, we first try to debug this by creating several dev environments with SSH access using the same provisioning templates and scripts. In these dev environments, we don’t have these errors at all. The shcluster looks fine in both the CLI and Splunk Web. There is no difference between the environments except for SSH access.
According to Splunk doc: http://docs.splunk.com/Documentation/Splunk/6.2.0/DistSearch/Handlememberfailure , we can normally solve this by using the Splunk CLI command: splunk resync shcluster-replicated-config
. There seems to be no Splunk Web UI alternative for this. Splunk doc mentions that the first error occurs when:
“Upon rejoining the cluster, the member attempts to apply the set of intervening replicated changes from the captain. If the set exceeds the purge limits and the member and captain no longer share a common commit, a banner message appears on the member's UI”
In our case there’s no member trying to rejoin the cluster. All three search heads are fresh installs. We temporarily turned SSH access on to figure out the root cause of this problem. When we check the shcluster-status, all three search heads show the same shcluster-status:
Captain:
dynamic_captain : 1
elected_captain : Fri Dec 11 13:36:28 2015
id : xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
initialized_flag : 1
label : ip-xx-x-xx-19
maintenance_mode : 0
mgmt_uri : https://xx.x.xx.19:8089
min_peers_joined_flag : 1
rolling_restart_flag : 0
service_ready_flag : 1
Members:
ip-xx-x-xx-20
label : ip-xx-x-xx-20
mgmt_uri : https://xx.x.xx.20:8089
mgmt_uri_alias : https://xx.x.xx.20:8089
status : Up
ip-xx-x-xx-19
label : ip-xx-x-xx-19
mgmt_uri : https://xx.x.xx.19:8089
mgmt_uri_alias : https://xx.x.xx.19:8089
status : Up
ip-xx-x-xx-21
label : ip-xx-x-xx-21
mgmt_uri : https://xx.x.xx.21:8089
mgmt_uri_alias : https://xx.x.xx.21:8089
status : Up
I don’t understand how and why the cluster is out of sync.
NTP is enabled and the search heads all use the same pass4symmkey.
Thanks!
kimche
The endpoint for a destructive resync should be /services/replication/configuration/commits
If you notice the baseline is out of sync in the splunkd.logs, or if you find yourself seeing the banner message to run a destructive resync, run the following command (should work):
curl -k -u admin:splunker https://127.0.0.1:8089/services/replication/configuration/commits
Basically is the equivalent of :
./splunk resync shcluster-replicated-config
Based off of $SPLUNK_HOME/etc/system/static/splunkrc_cmds.xml
1801 <item obj="shcluster-replicated-config">
1802 <cmd name="resync">
1803 <uri><![CDATA[/replication/configuration/commits]]></uri>
1804 <default>
1805 <arg name="resync_destructive" value="1" />
1806 </default>
1807 <help>
1808 <title><![CDATA[Destructively resyncs this node to the latest replicated config on the captain.]]></title>
1809 <examples>
1810 <ex><![CDATA['./splunk resync shcluster-replicated-config]]></ex>
1811 </examples>
1812 </help>
1813 <type>edit</type>
1814 </cmd>
1815 </item>
Specifically I don't understand why the cluster is trying to replicate a DMC asset supposed to be local
You can try some of the rest endpoints to see if they will "resync". I looked through code to find out what resync does but didnt have luck.
https://<host>:<mgmt_port>/services/shcluster/