Deployment Architecture

No route to host at 8089 cluster

mguhad
Communicator

My indexer cluster is down except for 1 out of 6. 8089 is suddenly not working for indexers and CM<>indexer comms and i get the below error messages. Its a multi site indexer cluste. I have ran telnet and curl commands on 8089 & indexers but still unable to connect to all but 1/6 indexers. Also, deployment server is not accessible. CM is unable to connect to 8089 for the indexers, the indexers cannot talk to each other on port 8089 either and the DS is not able to connect to my indexers at 9996.

FYI custom SSL is enabled at 8089 but i don't see as the cause for this connectivity issue.
I have checked with networking team who are saying its an application issue and not iptables/routing issue on the server like i suspected. Please help.

IDX:

02-10-2020 03:19:20.324 +0000 WARN  CMSlave - Failed to register with cluster master reason: failed method=POST path=/services/cluster/master/peers/?output_mode=json master=myCM:8089 rv=0 gotConnectionError=0 gotUnexpectedStatusCode=1 actual_response_code=500 expected_response_code=2xx status_line="Internal Server Error" socket_error="No error" remote_error=Cannot add peer=myidx mgmtport=8089 (reason: http client error=No route to host, while trying to reach https://myidx:8089/services/cluster/config). [ event=addPeer status=retrying AddPeerRequest: { _id= active_bundle_id=EF3B7708025567663732F8D6B146A83 add_type=Clear-Masks-And-ReAdd base_generation_id=2063 batch_serialno=1 batch_size=1 forwarderdata_rcv_port=9996 forwarderdata_use_ssl=1 last_complete_generation_id=0 latest_bundle_id=EF3B77080255676637732F8D6B146A83 mgmt_port=8089 name=EEC311D7-7778-44FA-B31D-E66672C1D568 register_forwarder_address= register_replication_address= register_search_address= replication_port=9100 replication_use_ssl=0 replications= server_name=myidx site=site3 splunk_version=7.2.6 splunkd_build_number=c0bf0f679ce9 status=Up } ].

CM:

02-07-2020 18:00:41.497 +0000 WARN  CMMaster - event=heartbeat guid=BDD6A029-2082-48ED-96F3-21BD624D94CD msg='signaling Clear-Masks-And-ReAdd' (unknown peer and master initialized=1
02-07-2020 18:00:41.911 +0000 WARN  TcpOutputFd - Connect to myidx:9996 failed. No route to host
02-07-2020 18:00:41.912 +0000 WARN  TcpOutputProc - Applying quarantine to ip=myidx port=9996 _numberOfFailures=2
02-07-2020 18:00:42.013 +0000 WARN  TcpOutputFd - Connect to myidx:9996 failed. No route to host
    02-07-2020 18:00:42.323 +0000 WARN  CMMaster - event=heartbeat guid=44AF1666-AB56-4CC1-8F01-842AD327CF79 msg='signaling Clear-Masks-And-ReAdd' (unknown peer and master initialized=1
02-07-2020 10:36:54.650 +0000 WARN  CMRepJob - _rc=0 statusCode=502 transErr="No route to host" peerErr=""
02-07-2020 10:36:54.650 +0000 WARN  CMRepJob - _rc=0 statusCode=502 transErr="No route to host" peerErr=""

DS trying to connect to indexers:

02-07-2020 11:56:12.097 +0000 WARN  TcpOutputFd - Connect to idx2:9996 failed. No route to host
02-07-2020 11:56:12.098 +0000 WARN  TcpOutputFd - Connect to idx3:9996 failed. No route to host
02-07-2020 11:56:13.804 +0000 WARN  TcpOutputFd - Connect to idx1:9996 failed. No route to host

 

Labels (3)
0 Karma
1 Solution

mguhad
Communicator

Just an update:

The issue was NOT splunk issue but instead turned out to be firewalld that was accidentally configured by unix maintenance team after patching which disabled ALL ports on the server. This affected servers individually hence why some indexers worked and some didn't.

After firewalld was disabled on all the indexers/CM, all worked fine again and the All indexers were visible on CM's dashboard again. Deployment server was also now reached from GUI and it was able to send internal logs to indexers + 8089 on DS began working fine as expected.

solution : turn off firewalld on each server

thanks again @nickhillscpl for your input which has led to me finding out the issue.

View solution in original post

0 Karma

anujarosha
Explorer

Following two commands that I ran in Splunk server, helped me to figure out the issue and providing a solution.

Check the port is available

lsof -i:8088

Open the port

firewall-cmd --add-port 8088/tcp

My Splunk server runs on CentOS 7

0 Karma

mguhad
Communicator

Just an update:

The issue was NOT splunk issue but instead turned out to be firewalld that was accidentally configured by unix maintenance team after patching which disabled ALL ports on the server. This affected servers individually hence why some indexers worked and some didn't.

After firewalld was disabled on all the indexers/CM, all worked fine again and the All indexers were visible on CM's dashboard again. Deployment server was also now reached from GUI and it was able to send internal logs to indexers + 8089 on DS began working fine as expected.

solution : turn off firewalld on each server

thanks again @nickhillscpl for your input which has led to me finding out the issue.

View solution in original post

0 Karma

nickhills
Ultra Champion

All those messages are telling you that you have connectivity errors.

Regardless of what the network team says, If Splunk is running on all your instances, you have a networking issue.
Check DNS, Layer3, Layer2, Firewalls

If my comment helps, please give it a thumbs up!

mguhad
Communicator

Well understood. The networking team have discounted it as application specifc issue not firewall which i disagree with for reasons you mention above.
But i want to clarify is, those issues are NOT being caused by splunk (or any of its configrations) but are instead being cuased by the routing/networking configurations on the servers/OS right?

0 Karma

nickhills
Ultra Champion

Correct " No route to host" means that the host is unreachable from the instance generating it.

Can you ping the indexers from one another (also will depend on FW rules etc), but its a good thing to test.

TCPDumping on 8089 (or your replication ports) is another good test - at a guess you will see indexers opening connections but not recieving anything back

If my comment helps, please give it a thumbs up!
0 Karma

mguhad
Communicator

Yes but splunkd remain running.

general PING without specifying works. However running a telnet between the indexers at port 8089 fails.

Trying myidx
no route to host
Failed to connect to myidx:8089; No route to host
closing connection 0
curl:(7) Failed connect to myidx:8089;No route to host

Thats on all indexer<>Indexer comms except the only one that is working for some reason which connects at 8089.
This is also the case for DS trying to connect to indexers at 9996...same above error.

From what i can see and the troubleshooting i have done, it all seems beyond splunk and OS/network related or am i missing something.

0 Karma

nickhills
Ultra Champion

try locally on one of the 'bad' indexers
curl -k https://localhost:8089
If you get an XML response locally, you have proved its a comms issue

If my comment helps, please give it a thumbs up!
0 Karma

mguhad
Communicator

hmm. What does that command prove/verify?
and yes great, i am able to get an XML response. So does this conclude the issue is indeed not Splunk as an application and more to do with OS/networking layers/routing issue of the server ?

0 Karma

nickhills
Ultra Champion

That command connects from the server you run it on to itself. (lets call that server "serverA.yourco.com")
If you get an XML response, then it means the application is responding to itself

If you run the command remotely eg from "serverB"
curl -k https://serverA.yourco.com:8089 and you do not get the same response, then it proves that "serverB" can not talk to "serverA" on port 8089

If my comment helps, please give it a thumbs up!
0 Karma

mguhad
Communicator

Oh ok got it! thanks.
So it gets odd from here. I am able to connect (get XML response) to all but 2 of the indexers. i am also able to connect to the CM with that command!

So if i can connect to the CM, why isnt this indexer actually connecting to the CM at 8089 and showup on the CM's dashboard?

This has confused me more now.

0 Karma

nickhills
Ultra Champion

Are you sure somthing hasn't "fixed itself" in the same way it "broke itself" 🙂

If your cluster is unhealthy, i would restart the CM, then when it is restarted put it in maintenance mode.

Wait until all the peers are back online, and then take it out of maintenance mode, and let it run its fixup,

If the data in this cluster is business critical, it may well be worth opening a ticket with support to get some proper assistance with the rebuild if you are not comfortable with the process.

If my comment helps, please give it a thumbs up!
0 Karma

mguhad
Communicator

No nothing fixed itself only 1/6 indexers is showing on my CM dashboard, so the cluster is in an unhealthy loop as it can't meet RF/SF.

BUT... wheni ran the above command from CM <>indexers rather than idx<>idx and i am not able to establish connection at all. Only able to establish connection with the only working indexer. so does this mean cm server is at fault here when its originating.

0 Karma

nickhills
Ultra Champion

From the CM, if you ping each of the DNS names for the indexers, does it resolve to the correct IP address?

And vice versa, if you ping the CM from each IDX, does it resolve the correct IP for the CM?

If my comment helps, please give it a thumbs up!
0 Karma

mguhad
Communicator

We are using ip addresses rather than DNS so that would not work. But when i ping indexers from CM, no connection at 8089.
PING works fine, just port specific becomes an issue.

But when the idx are the origin, they can connect to CM fine just not the other way round. That might explain the "CMRepJob" errors on the CM as it can't connect to the indexers?

Simply trying to rule out whether it is a splunk issue or OS/server routing issue.

Very Odd behaviour. I have enabled maintenence mode now on the CM. Will chase this up with networks and raise a ticket if all fails.

0 Karma

codebuilder
Influencer

What has changed in your cluster since it was working previously? Reboots? OS Upgrade? Patch? Other?

----
An upvote would be appreciated and Accept Solution if it helps!
0 Karma

codebuilder
Influencer

In your Splunk instance, how are you referring to theses servers in the configs? IP or hostname? If the latter, have you tried a nslookup on them? If DNS fails, obviously the traffic will fail.

Also, I would ask your network team if they are certain the ACL is bi-directional.

----
An upvote would be appreciated and Accept Solution if it helps!
0 Karma

mguhad
Communicator

We are using IP addresses so no DNs at all. Although i havent asked about what ACL mode is being used

0 Karma
Register for .conf21 Now! Go Vegas or Go Virtual!

How will you .conf21? You decide! Go in-person in Las Vegas, 10/18-10/21, or go online with .conf21 Virtual, 10/19-10/20.