Deployment Architecture

Search Head Cluster: Why am I getting "Connection reset by peer" and 502 errors trying to bring up the cluster captain?

fernandoandre
Communicator

When trying to setup a search head cluster (SHC) I'm getting the following error when bringing up the cluster captain:

"In handler 'shclustermemberconsensus': Failed to Set Configuration. One potential reason is captain could not hear back from all the nodes in a timeout period. Ensure all to be added nodes are up, and increase the raft timeout. If all nodes are up and running, look at splunkd.log for appendEntries errors due to mgmt_uri mismatch"

When investigating splunkd.log we have the following errors/warnings:

"01-20-2015 11:18:15.871 +0000 WARN  HttpListener - Socket error from 127.0.0.1 while idling: error:1407609C:SSL routines:SSL23_GET_CLIENT_HELLO:http request
01-20-2015 11:18:16.107 +0000 ERROR HttpClientRequest - HTTP client error: Connection reset by peer (while accessing http://mysystem2:8089/services/shcluster/captain/members/8B59C0E4-9436-427D-B338-361DC02A3B6B)
01-20-2015 11:18:16.107 +0000 WARN  SHPMasterHTTPProxy - Low Level http request failure err=failed method=POST path=/services/shcluster/captain/members/8B59C0E4-9436-427D-B338-361DC02A3B6B captain=mysystem2:8089 rc=0 actual_response_code=502 expected_response_code=200 status_line=Connection reset by peer error="Connection reset by peer"
01-20-2015 11:18:18.991 +0000 ERROR HTTPClient - Should have gotten at least 3 tokens in status line, while getting response code.  Only got 0.
01-20-2015 11:18:20.681 +0000 ERROR HTTPClient - Should have gotten at least 3 tokens in status line, while getting response code.  Only got 0.
01-20-2015 11:18:20.681 +0000 ERROR SHPRaftConsensus - failed appendEntriesRequest err: Failure, rc=6: Unknown read error to http://mysystem4:8089
01-20-2015 11:18:20.681 +0000 ERROR HTTPClient - Should have gotten at least 3 tokens in status line, while getting response code.  Only got 0.
01-20-2015 11:18:20.682 +0000 ERROR HTTPClient - Should have gotten at least 3 tokens in status line, while getting response code.  Only got 0.
01-20-2015 11:18:20.682 +0000 ERROR SHPRaftConsensus - failed appendEntriesRequest err: Failure, rc=6: Unknown read error to http://mysystem3:8089
01-20-2015 11:18:21.113 +0000 WARN  HttpListener - Socket error from 127.0.0.1 while idling: error:1407609C:SSL routines:SSL23_GET_CLIENT_HELLO:http request
01-20-2015 11:18:21.114 +0000 ERROR HttpClientRequest - HTTP client error: Connection reset by peer (while accessing http://mysystem2:8089/services/shcluster/captain/members/8B59C0E4-9436-427D-B338-361DC02A3B6B)
01-20-2015 11:18:21.114 +0000 WARN  SHPMasterHTTPProxy - Low Level http request failure err=failed method=POST path=/services/shcluster/captain/members/8B59C0E4-9436-427D-B338-361DC02A3B6B captain=mysystem2:8089 rc=0 actual_response_code=502 expected_response_code=200 status_line=Connection reset by peer error="Connection reset by peer"
01-20-2015 11:18:26.120 +0000 WARN  HttpListener - Socket error from 127.0.0.1 while idling: error:1407609C:SSL routines:SSL23_GET_CLIENT_HELLO:http request
01-20-2015 11:18:26.121 +0000 ERROR HttpClientRequest - HTTP client error: Connection reset by peer (while accessing http://mysystem2:8089/services/shcluster/captain/members/8B59C0E4-9436-427D-B338-361DC02A3B6B)
01-20-2015 11:18:26.121 +0000 WARN  SHPMasterHTTPProxy - Low Level http request failure err=failed method=POST path=/services/shcluster/captain/members/8B59C0E4-9436-427D-B338-361DC02A3B6B captain=mysystem2:8089 rc=0 actual_response_code=502 expected_response_code=200 status_line=Connection reset by peer error="Connection reset by peer"

Systems: all are linux CentOS 6.5
Splunk version 6.2 from package splunk-6.2.1-245427-Linux-x86_64.gz
Initials considerations were DNS and fqdn, but they are both working.
Also, both iptables and selinux are disabled as it seemed ports were blocked, but this is not the case.
On server.conf the following configurations were changed to test timeouts:

cxn_timeout_raft = 50
send_timeout_raft = 50
rcv_timeout_raft = 50
rep_cxn_timeout = 50
heartbeat_timeout = 120

In addition, I have done a wget on the same system where I try to bring up the captain for link that appears in the log and: "http://mysystem2:8089/services/shcluster/captain/members/8B59C0E4-9436-427D-B338-361DC02A3B6B" and I get the same error of "connection reset by peer".

"Connecting to mysystem2|127.0.0.1|:8089... connected.
HTTP request sent, awaiting response... Read error (Connection reset by peer) in headers.
Retrying."

It seems the peer itself is resetting the connections. All others are doing the same if I try to bring them as captains.
Doing a netstat -ant shows that mgmt port 8089 is opened.

Also checked for the

actual_response_code=502 expected_response_code=200 status_line=Connection reset by peer error="Connection reset by peer"

And the 502 error means bad gateway, but these machines are on a local network.

Any ideas? Has anyone faced this situation?

sborys
Explorer

So I got my cluster working.

I noticed that while using splunk documentation. When inititating search head peers etc, the doc provides https examples.

Since I knew I wasn't using SSL I inputed http for all of my initating and config files.

Splunk however uses its own certificates and that's why you should use https for all of your URI's. I also used regular IP numbers as instead of domain names.

Thanks.

fernandoandre
Communicator

In my specific case I wasn't able to discover the root cause of this issue but I suspected it was related with previous splunk configurations done during tests since I had no firewall enabled at the time. I "fixed" the problem by resetting the entire cluster, or in other words, clean all existing configurations and start with clean configurations. These were the steps I followed:

  1. Remove members from cluster (it might fail with error, ignore and continue with next steps) - splunk remove shcluster-member
  2. Stop splunk in all SHC members and SH deployer - splunk stop
  3. Disable sh cluster config - splunk disable shcluster-config
  4. Delete all sh cluster related configs in server.conf - $SPLUNK_HOME/etc/system/local/server.conf
  5. Clean KVStore - splunk clean kvstore (also have the alternative to clean everything with splunk clean all)
  6. Ensure both SH deployer and all SHC members have their entries in DNS and can query DNS server
  7. Ensure firewall rules allow tcp/8089 and your chosen replication port, example: 8090
  8. Start the normal process to setup a SH cluster

PS. I have firewall enabled and the cluster works fine.

KVStore - https://docs.splunk.com/Documentation/Splunk/6.5.1/Admin/ResyncKVstore
Remove members from cluster - https://docs.splunk.com/Documentation/Splunk/6.5.1/DistSearch/Removeaclustermember

0 Karma

sborys
Explorer

Any solutions to this? I'm getting exact same thing. Ip tables are turned off. I am unable to assign the captain due to the above. This is quite frustrating.

0 Karma

bohanlon_splunk
Splunk Employee
Splunk Employee

"Should have received at least 3 tokens in status line, while getting response code. Only got 0" could mean your splunk instance is over-subscribed. That or Network problems.

As per:
https://answers.splunk.com/answers/217/error-httpclient-should-have-gotten-at-least-3-tokens-in-stat...

tprzelom
Path Finder

"service iptables stop"

Linux firewall was my problem

0 Karma

dnewburg
New Member

Was any resolution discovered for this issue?

0 Karma

jayannah
Builder

Me too facing this issue... please let us know any one is able find the fix.

0 Karma
Get Updates on the Splunk Community!

Continuing Innovation & New Integrations Unlock Full Stack Observability For Your ...

You’ve probably heard the latest about AppDynamics joining the Splunk Observability portfolio, deepening our ...

Monitoring Amazon Elastic Kubernetes Service (EKS)

As we’ve seen, integrating Kubernetes environments with Splunk Observability Cloud is a quick and easy way to ...

Cloud Platform & Enterprise: Classic Dashboard Export Feature Deprecation

As of Splunk Cloud Platform 9.3.2408 and Splunk Enterprise 9.4, classic dashboard export features are now ...