Deployment Architecture

Why do I have this search head cluster synchronization problem?

kimche
Path Finder

Hi all,

Upon provisioning a new environment (where we don’t have SSH access), we’ve noticed 2 errors in the messages section of Splunk Web:

 Error pulling configurations from the search head cluster captain (https://xx.x.xx.19:8089); consider performing a destructive configuration resync on this search head cluster member (on instances xx.x.xx.21, xx.x.xx.20)
 The search head cluster captain (https://xx.x.xx.21:8089 (https://xx.x.xx.21:8089/)) is disconnected; skipping configuration replication (on xx.x.xx.19)

In the splunkd.log we see another error:

12-11-2015 15:17:01.177 +0000 ERROR SHPRepJob - failed job=SHPDelegateSearchJob guid=xxxxxxxxxxxxxxxxxxxxxx hp=xx.x.xx.20:8089 saved_search=system;; err=error accessing https://xx.x.57.20:8089/servicesNS/admin/splunk_management_console/shcluster/member/delegatejob/DMC%..., statusCode=404, description=Not Found 

We’ve recreated the search head cluster with new instances several times. Sometimes the first error occurs on only 1 instance and the 2nd error doesn’t occur at all. They appear almost immediately after the 3rd search head is done with provisioning.

As we don’t have SSH access, we first try to debug this by creating several dev environments with SSH access using the same provisioning templates and scripts. In these dev environments, we don’t have these errors at all. The shcluster looks fine in both the CLI and Splunk Web. There is no difference between the environments except for SSH access.

According to Splunk doc: http://docs.splunk.com/Documentation/Splunk/6.2.0/DistSearch/Handlememberfailure , we can normally solve this by using the Splunk CLI command: splunk resync shcluster-replicated-config. There seems to be no Splunk Web UI alternative for this. Splunk doc mentions that the first error occurs when:

“Upon rejoining the cluster, the member attempts to apply the set of intervening replicated changes from the captain. If the set exceeds the purge limits and the member and captain no longer share a common commit, a banner message appears on the member's UI” 

In our case there’s no member trying to rejoin the cluster. All three search heads are fresh installs. We temporarily turned SSH access on to figure out the root cause of this problem. When we check the shcluster-status, all three search heads show the same shcluster-status:

Captain:
                          dynamic_captain : 1
                          elected_captain : Fri Dec 11 13:36:28 2015
                                       id : xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                         initialized_flag : 1
                                    label : ip-xx-x-xx-19
                         maintenance_mode : 0
                                 mgmt_uri : https://xx.x.xx.19:8089
                    min_peers_joined_flag : 1
                     rolling_restart_flag : 0
                       service_ready_flag : 1

Members:
        ip-xx-x-xx-20
                                    label : ip-xx-x-xx-20
                                 mgmt_uri : https://xx.x.xx.20:8089
                           mgmt_uri_alias : https://xx.x.xx.20:8089
                                   status : Up
        ip-xx-x-xx-19
                                    label : ip-xx-x-xx-19
                                 mgmt_uri : https://xx.x.xx.19:8089
                           mgmt_uri_alias : https://xx.x.xx.19:8089
                                   status : Up
        ip-xx-x-xx-21
                                    label : ip-xx-x-xx-21
                                 mgmt_uri : https://xx.x.xx.21:8089
                           mgmt_uri_alias : https://xx.x.xx.21:8089
                                   status : Up

I don’t understand how and why the cluster is out of sync.
NTP is enabled and the search heads all use the same pass4symmkey.

Thanks!

kimche

ben_leung
Builder

The endpoint for a destructive resync should be /services/replication/configuration/commits

If you notice the baseline is out of sync in the splunkd.logs, or if you find yourself seeing the banner message to run a destructive resync, run the following command (should work):

curl -k -u admin:splunker https://127.0.0.1:8089/services/replication/configuration/commits

Basically is the equivalent of :

./splunk resync shcluster-replicated-config

Based off of $SPLUNK_HOME/etc/system/static/splunkrc_cmds.xml

1801     <item obj="shcluster-replicated-config">
1802         <cmd name="resync">
1803             <uri><![CDATA[/replication/configuration/commits]]></uri>
1804             <default>
1805                 <arg name="resync_destructive" value="1" />
1806             </default>
1807             <help>
1808                 <title><![CDATA[Destructively resyncs this node to the latest replicated config on the captain.]]></title>
1809                 <examples>
1810                     <ex><![CDATA['./splunk resync shcluster-replicated-config]]></ex>
1811                 </examples>
1812             </help>
1813             <type>edit</type>
1814         </cmd>
1815     </item>

greich
Communicator

Specifically I don't understand why the cluster is trying to replicate a DMC asset supposed to be local

0 Karma

jkat54
SplunkTrust
SplunkTrust

You can try some of the rest endpoints to see if they will "resync". I looked through code to find out what resync does but didnt have luck.

https://<host>:<mgmt_port>/services/shcluster/
0 Karma
Get Updates on the Splunk Community!

Enterprise Security Content Update (ESCU) | New Releases

In December, the Splunk Threat Research Team had 1 release of new security content via the Enterprise Security ...

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

(This is the first of a series of 2 blogs). Splunk Enterprise Security is a fantastic tool that offers robust ...

Index This | What are the 12 Days of Splunk-mas?

December 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...