Hi guys,
I have an issue with my Search Head cluster, the replication seems to not be working:
192.128.192.131 is the SearchHead1
192.128.192.136 is the Searchhead2
11-15-2014 12:42:32.993 +0100 WARN ConfReplicationThread - Error pulling configurations from captain=https://192.168.192.131:8089, consecutiveErrors=966: Error in fetchFrom, at=: Non-200 status_code=500: refuse request without valid baseline; snapshot exists at op_id=1a4a26781bed0c57c325b1fd297fb07082eba435 for repo=https://192.168.192.131:8089
11-15-2014 12:42:32.990 +0100 ERROR HttpListener - Exception while processing request from 192.168.192.136 for /services/replication/configuration/commits?output_mode=json&at=: refuse request without valid baseline; snapshot exists at op_id=1a4a26781bed0c57c325b1fd297fb07082eba435 for repo=https://192.168.192.131:8089
The captain feature is working, if i stop the captain the other Search Head becomes the captain (according the command "splunk show shcluster-status").
Here is my server.conf on Search Heads:
[shclustering]
conf_deploy_fetch_url = https://192.168.192.134:8089 # DEPLOYER URL
disabled = 0
mgmt_uri = https://192.168.192.136:8089 # IP OF CURRENT SERVER
pass4SymmKey = $1$oov1Lgj65W5z
replication_factor = 2
id = 6EFA87CF-8D4D-43D5-85D3-DE8BAD78403E
Does someone see where is my problem ??
Found the problem,
http://docs.splunk.com/Documentation/Splunk/6.2.0/DistSearch/Handlememberfailure
Splunk resync shcluster-replicated-config
some further update for errors like below
08-01-2017 10:03:37.694 -0700 WARN ConfReplicationThread - Error pulling configurations from captain=https://:8089, consecutiveErrors=2 msg="Error in fetchFrom, at=ae823222d0607652969d338bb793469fb7de85cd: Network-layer error: Connect Timeout
Please not that consecutiveErrors is not larger than 10 is not considered as a real issue. It can be due to the captain side is busy and not be able to response in time.
Check what is the consecutiveErrors count for you using search like
Index=_internal ( host= OR host= OR host= OR host=) source="splunkd.log" "ConfReplicationThread - Error pulling configurations from captain" | stats max(consecutiveErrors) by host
It's not an issue is the consecutiveErrors<10. In case error is above 10 log case with Splunk Support
Normally this error means that Serahc Head Cluster member is fallen behind in replication - I think it may be good idea to debug why configurations aren't sync-ing in the first place and address the root cause.
A destructive resync is only truly required if the member has fallen really far behind the captain -- i.e. 20000 changes behind (by default) -- or if local state is completely corrupted/invalid (e.g. corrupt filesystem).
For Search Head cluster please refer answers below to ensure that Search Head Cluster members are configured as per requirement.
Found the problem,
http://docs.splunk.com/Documentation/Splunk/6.2.0/DistSearch/Handlememberfailure
Splunk resync shcluster-replicated-config
I downvoted this post because this doesn't fix the underlaying issue (i.e. identify the cause of the replication bottleneck). this just temporarily works around it.
Ok. But in the docs it states:
"Caution: This command causes an overwrite of the member's entire set of search-related configurations, resulting in the loss of any local changes."
What does "loss of local changes" mean with this? Any changes that have been made are lost? For all time? For the last hour?
It means any changes that you have made to that search head alone, as opposed to those changes that get propagated (through either the deployer or automatic replication) across the set of cluster members.
I guess you loose all changes since the last replication.
Without replication to other Search Head members, your changes are local.
This is how I understand that.