Hello,
Today I modified the /etc/system/local/authentication.conf file on all Search Head Cluster members because most settings should be pushed by the Deployer in a separate app. Authentication still is working fine (local and LDAP)...
Now, when I do a /opt/splunk/bin/splunk apply shcluster-bundle ...
I get the following error:
Error while deploying apps to target=https://name.xyz:8089 with members=2: no captain found amongst members
The internal log is as follows:
127.0.0.1 - admin [02/Mar/2016:11:10:35.265 +0000] "POST /services/apps/deploy HTTP/1.1" 500 245 - - - 10529ms
But the SHC looks fine in the Distributed Management Console and I get the following output when checking cluster status on CLI:
name1# /opt/splunk/bin/splunk show shcluster-status
Captain:
dynamic_captain : 1
elected_captain : Wed Mar 2 10:48:04 2016
id : B2542A43-0D49-4235-ABAA-6749581BA6DC
initialized_flag : 1
label : name1
maintenance_mode : 0
mgmt_uri : https://name1.xyz:8089
min_peers_joined_flag : 1
rolling_restart_flag : 0
service_ready_flag : 1
Members:
name2
label : name2
mgmt_uri : https://name2.xyz:8089
mgmt_uri_alias : https://1.1.1.2:8089
status : Up
name3
label : name3
mgmt_uri : https://name3.xyz:8089
mgmt_uri_alias : https://1.1.1.3:8089
status : Up
Thanks,
/Rainer
Hi Rainer, based on your show cluster-status output, it looks like you are getting this message because the captain is actually not a member of the cluster. While name2 and name3 are present in the members list, name1 is not.
Additionally, I would specifically target the captain when running apply shcluster-bundle command. i.e. name1 instead of name.
I would try a restart of name1 and see if that prompts a re-election and hopefully have name1 join the cluster successfully. Otherwise it looks like you might have a deeper problem with the SHC that would require some assistance from support.
Please let me know if this helps!
I recently ran into the same issue, captain elected but missing in member list and didn't respond to other members anymore. A reboot helped, but not for long, cluster changed to unstabil pretty quick again. Started digging deeper and found the dispatch directory filling (+150k directories) and reaper didn't clean up, so I/O went up crazy. I identified a RT scheduled search causing splunk (6.5.5) keeping all the rt_scheduler__nobody* directories. A rewrite of the search fixed it. Cleaning the dispatch and cluster was running fine again. I afraid i spotted a possible bug in this version.
Hi Rainer, based on your show cluster-status output, it looks like you are getting this message because the captain is actually not a member of the cluster. While name2 and name3 are present in the members list, name1 is not.
Additionally, I would specifically target the captain when running apply shcluster-bundle command. i.e. name1 instead of name.
I would try a restart of name1 and see if that prompts a re-election and hopefully have name1 join the cluster successfully. Otherwise it looks like you might have a deeper problem with the SHC that would require some assistance from support.
Please let me know if this helps!
I did a reboot of the complete box (splunk restart was not enough) and a new captain was elected. I now see all three nodes as cluster members. Thank aou for the hint!
Awesome, glad to hear! 😄
How many search heads do you have in total (including captain?). Is splunkd up on all of them?
Also, when you push authentication.conf, i am assuming you have a the strategy with BIND password on each and every search head as well correct? Sorry if i misread, reason i ask is , you cannot push one copy of LDAP strategy from Deployer where the password is already encrypted. It happened to me once during my new to SHC days.
And like @harsmarvania57 mentioned, name 1 should appear in members list as well.
Assuming you are on latest build, have you tried this
http://docs.splunk.com/Documentation/Splunk/6.3.3/DistSearch/Staticcaptain
Thanks,
Raghav
There are 3 members total in the cluster and splunkd is up and running on all of them. LDAP config seems to be ok on all devices since I am able to login with the LDAP account when accessing the nodes directly.
I'll try the static captain thing...
I tried the static captain configuration on the dynamic captain and got the following output:
In handler 'shclusterconfig': Could not contact captain. Check that the captain is up, the captain_uri=https://name1:8089 and secret are specified correctly Err : Failure, rc=2: Connect to=https://name1:8089 timed out; exceeded 30sec LowerLevelErrors = SocketError connecting to=name1:8089 WARN: Connect to=name1:8089 timed out; exceeded 30sec
It is extremely strange but after rebooting the complete box (splunkd restart was not enough), a new master was elected and now everything is fine...
Can you please check why members are showing "name2" and "name3" , "name1" must be in Members as well.
name1 is not in the members list. How can I check why?