When trying to setup a search head cluster (SHC) I'm getting the following error when bringing up the cluster captain:
"In handler 'shclustermemberconsensus': Failed to Set Configuration. One potential reason is captain could not hear back from all the nodes in a timeout period. Ensure all to be added nodes are up, and increase the raft timeout. If all nodes are up and running, look at splunkd.log for appendEntries errors due to mgmt_uri mismatch"
When investigating splunkd.log we have the following errors/warnings:
"01-20-2015 11:18:15.871 +0000 WARN HttpListener - Socket error from 127.0.0.1 while idling: error:1407609C:SSL routines:SSL23_GET_CLIENT_HELLO:http request
01-20-2015 11:18:16.107 +0000 ERROR HttpClientRequest - HTTP client error: Connection reset by peer (while accessing http://mysystem2:8089/services/shcluster/captain/members/8B59C0E4-9436-427D-B338-361DC02A3B6B)
01-20-2015 11:18:16.107 +0000 WARN SHPMasterHTTPProxy - Low Level http request failure err=failed method=POST path=/services/shcluster/captain/members/8B59C0E4-9436-427D-B338-361DC02A3B6B captain=mysystem2:8089 rc=0 actual_response_code=502 expected_response_code=200 status_line=Connection reset by peer error="Connection reset by peer"
01-20-2015 11:18:18.991 +0000 ERROR HTTPClient - Should have gotten at least 3 tokens in status line, while getting response code. Only got 0.
01-20-2015 11:18:20.681 +0000 ERROR HTTPClient - Should have gotten at least 3 tokens in status line, while getting response code. Only got 0.
01-20-2015 11:18:20.681 +0000 ERROR SHPRaftConsensus - failed appendEntriesRequest err: Failure, rc=6: Unknown read error to http://mysystem4:8089
01-20-2015 11:18:20.681 +0000 ERROR HTTPClient - Should have gotten at least 3 tokens in status line, while getting response code. Only got 0.
01-20-2015 11:18:20.682 +0000 ERROR HTTPClient - Should have gotten at least 3 tokens in status line, while getting response code. Only got 0.
01-20-2015 11:18:20.682 +0000 ERROR SHPRaftConsensus - failed appendEntriesRequest err: Failure, rc=6: Unknown read error to http://mysystem3:8089
01-20-2015 11:18:21.113 +0000 WARN HttpListener - Socket error from 127.0.0.1 while idling: error:1407609C:SSL routines:SSL23_GET_CLIENT_HELLO:http request
01-20-2015 11:18:21.114 +0000 ERROR HttpClientRequest - HTTP client error: Connection reset by peer (while accessing http://mysystem2:8089/services/shcluster/captain/members/8B59C0E4-9436-427D-B338-361DC02A3B6B)
01-20-2015 11:18:21.114 +0000 WARN SHPMasterHTTPProxy - Low Level http request failure err=failed method=POST path=/services/shcluster/captain/members/8B59C0E4-9436-427D-B338-361DC02A3B6B captain=mysystem2:8089 rc=0 actual_response_code=502 expected_response_code=200 status_line=Connection reset by peer error="Connection reset by peer"
01-20-2015 11:18:26.120 +0000 WARN HttpListener - Socket error from 127.0.0.1 while idling: error:1407609C:SSL routines:SSL23_GET_CLIENT_HELLO:http request
01-20-2015 11:18:26.121 +0000 ERROR HttpClientRequest - HTTP client error: Connection reset by peer (while accessing http://mysystem2:8089/services/shcluster/captain/members/8B59C0E4-9436-427D-B338-361DC02A3B6B)
01-20-2015 11:18:26.121 +0000 WARN SHPMasterHTTPProxy - Low Level http request failure err=failed method=POST path=/services/shcluster/captain/members/8B59C0E4-9436-427D-B338-361DC02A3B6B captain=mysystem2:8089 rc=0 actual_response_code=502 expected_response_code=200 status_line=Connection reset by peer error="Connection reset by peer"
Systems: all are linux CentOS 6.5
Splunk version 6.2 from package splunk-6.2.1-245427-Linux-x86_64.gz
Initials considerations were DNS and fqdn, but they are both working.
Also, both iptables and selinux are disabled as it seemed ports were blocked, but this is not the case.
On server.conf the following configurations were changed to test timeouts:
cxn_timeout_raft = 50
send_timeout_raft = 50
rcv_timeout_raft = 50
rep_cxn_timeout = 50
heartbeat_timeout = 120
In addition, I have done a wget on the same system where I try to bring up the captain for link that appears in the log and: "http://mysystem2:8089/services/shcluster/captain/members/8B59C0E4-9436-427D-B338-361DC02A3B6B" and I get the same error of "connection reset by peer".
"Connecting to mysystem2|127.0.0.1|:8089... connected.
HTTP request sent, awaiting response... Read error (Connection reset by peer) in headers.
Retrying."
It seems the peer itself is resetting the connections. All others are doing the same if I try to bring them as captains.
Doing a netstat -ant shows that mgmt port 8089 is opened.
Also checked for the
actual_response_code=502 expected_response_code=200 status_line=Connection reset by peer error="Connection reset by peer"
And the 502 error means bad gateway, but these machines are on a local network.
Any ideas? Has anyone faced this situation?
... View more