Deployment Architecture

Why does our Splunk 6.3 Search Head Cluster fail only when a specific member is elected as the Captain?

Path Finder

I have built a search head cluster in our 6.3 Splunk environment. Distributed Searches, App roll out and general searches all work fine until a specific member of the SHCl is elected captain.

What would cause the SHCuster to work fine, but fail when 1 specific host is elected captain?

The working status of the shcluster looks like this:

Captain:
dynamic_captain : 1
elected_captain : Mon Oct 26 18:47:19 2015
id : .....406DD1B10D4F
initialized_flag : 1
label : workingSH1.sh
maintenance_mode : 0
mgmt_uri : https://workingSH1.sh:8089
min_peers_joined_flag : 1
rolling_restart_flag : 0
service_ready_flag : 1

Members:
problemcaptain.sh
label : problemcaptain.sh
mgmt_uri : https://problemcaptain.sh:8089
mgmt_uri_alias : https://10.1.1.2:8089
status : Up

workingSH1.sh
label : workingSH1.sh
mgmt_uri : https://workingSH1.sh:8089
mgmt_uri_alias : https://10.1.2.1:8089
status : Up

workingSH2.sh
label : workingSH2.sh
mgmt_uri : https://workingSH2.sh:8089
mgmt_uri_alias : https://10.1.3.2:8089
status : Up

The system seems to work fine in this state. Reports, alerts and manual search all work fine. Deployment of new apps or changes to existing from the deployer work as expected.

The SHCluster fails when the problemcaptain.sh host is elected SHCluster captain. The problemcaptain.sh no longer shows as a member in the status, only showing as the Captain.

Captain:
dynamic_captain : 1
elected_captain : Mon Oct 26 18:27:32 2015
id : .....406DD1B10D4F
initialized_flag : 0
label : problemcaptain.sh
maintenance_mode : 0
mgmt_uri : https://problemcaptain.sh:8089
min_peers_joined_flag : 0
rolling_restart_flag : 0
service_ready_flag : 0

Members:

workingSH1.sh
label : workingSH1.sh
mgmt_uri : https://workingSH1.sh:8089
mgmt_uri_alias : https://10.1.2.1:8089
status : Up

workingSH2.sh
label : workingSH2.sh
mgmt_uri : https://workingSH2.sh:8089
mgmt_uri_alias : https://10.1.3.2:8089
status : Up

Path Finder

Not sure why that would be, but I found that most shcluster issues can be solved as follows;

Make sure the problem system is not currently the captain (restart service?).
On the captain, remove the problem member from the cluster;
/opt/splunk/bin/splunk remove shcluster-member -mgmt_uri https://problemcaptain.sh:8089

now, on the problem system, shut down splunk (service splunk stop or /opt/splunk/bin/splunk stop or whatever)
then clean it's data and caches etc;
/opt/splunk/bin/splunk clean all (note: most of the config cleared is the replicated config that will re-appear once you rejoin the cluster. But the admin password is cleared as well)

Then start splunk and rejoin to the cluster from the captain;
/opt/splunk/bin/splunk add shcluster-member -new_member_uri https://problemcaptain.sh:8089

Wait 1 minute and recheck shcluster-status

0 Karma

Path Finder

I followed the process to remove and re-add the member to the shcluster. I hadn't ran the clean all command in previous attempts. I continue to see the same trouble when this host is elected captain.

I noticed this event in the splunkd.log while the host is assigned the captain role.

10-27-2015 18:18:48.839 +0000 WARN  SHPMasterHTTPProxy - Low Level http request failure err=failed method=POST path=/services/shcluster/captain/members/.....DE0530D0C48E captain=problemcaptain.sh:8089 rc=0 actual_response_code=401 expected_response_code=200 status_line=Unauthorized error="<response>\n  <messages>\n    <msg type="WARN">call not properly authenticated</msg>\n  </messages>\n</response>\n"

The captain looks like it is attempting to post something to itself and failing. I don't see this with the other members of the shcluster. This could be password related. Any idea what password is used for this POST?

Thank you for all your help

0 Karma

Motivator

seems like your cluster password isn't correct or has been corrupted somehow?

try updating your server.conf on the non working machine with a plain text cluster password and restart (don't worry it will encrypt it on restart).

example. change mysecretclusterpassword for your actual cluster join password.

server.conf
[shclustering]
pass4SymmKey = mysecretclusterpassword

0 Karma
State of Splunk Careers

Access the Splunk Careers Report to see real data that shows how Splunk mastery increases your value and job satisfaction.

Find out what your skills are worth!