My indexer cluster is down except for 1 out of 6. 8089 is suddenly not working for indexers and CM<>indexer comms and i get the below error messages. Its a multi site indexer cluste. I have ran telnet and curl commands on 8089 & indexers but still unable to connect to all but 1/6 indexers. Also, deployment server is not accessible. CM is unable to connect to 8089 for the indexers, the indexers cannot talk to each other on port 8089 either and the DS is not able to connect to my indexers at 9996.
FYI custom SSL is enabled at 8089 but i don't see as the cause for this connectivity issue.
I have checked with networking team who are saying its an application issue and not iptables/routing issue on the server like i suspected. Please help.
IDX:
02-10-2020 03:19:20.324 +0000 WARN CMSlave - Failed to register with cluster master reason: failed method=POST path=/services/cluster/master/peers/?output_mode=json master=myCM:8089 rv=0 gotConnectionError=0 gotUnexpectedStatusCode=1 actual_response_code=500 expected_response_code=2xx status_line="Internal Server Error" socket_error="No error" remote_error=Cannot add peer=myidx mgmtport=8089 (reason: http client error=No route to host, while trying to reach https://myidx:8089/services/cluster/config). [ event=addPeer status=retrying AddPeerRequest: { _id= active_bundle_id=EF3B7708025567663732F8D6B146A83 add_type=Clear-Masks-And-ReAdd base_generation_id=2063 batch_serialno=1 batch_size=1 forwarderdata_rcv_port=9996 forwarderdata_use_ssl=1 last_complete_generation_id=0 latest_bundle_id=EF3B77080255676637732F8D6B146A83 mgmt_port=8089 name=EEC311D7-7778-44FA-B31D-E66672C1D568 register_forwarder_address= register_replication_address= register_search_address= replication_port=9100 replication_use_ssl=0 replications= server_name=myidx site=site3 splunk_version=7.2.6 splunkd_build_number=c0bf0f679ce9 status=Up } ].
CM:
02-07-2020 18:00:41.497 +0000 WARN CMMaster - event=heartbeat guid=BDD6A029-2082-48ED-96F3-21BD624D94CD msg='signaling Clear-Masks-And-ReAdd' (unknown peer and master initialized=1
02-07-2020 18:00:41.911 +0000 WARN TcpOutputFd - Connect to myidx:9996 failed. No route to host
02-07-2020 18:00:41.912 +0000 WARN TcpOutputProc - Applying quarantine to ip=myidx port=9996 _numberOfFailures=2
02-07-2020 18:00:42.013 +0000 WARN TcpOutputFd - Connect to myidx:9996 failed. No route to host
02-07-2020 18:00:42.323 +0000 WARN CMMaster - event=heartbeat guid=44AF1666-AB56-4CC1-8F01-842AD327CF79 msg='signaling Clear-Masks-And-ReAdd' (unknown peer and master initialized=1
02-07-2020 10:36:54.650 +0000 WARN CMRepJob - _rc=0 statusCode=502 transErr="No route to host" peerErr=""
02-07-2020 10:36:54.650 +0000 WARN CMRepJob - _rc=0 statusCode=502 transErr="No route to host" peerErr=""
DS trying to connect to indexers:
02-07-2020 11:56:12.097 +0000 WARN TcpOutputFd - Connect to idx2:9996 failed. No route to host
02-07-2020 11:56:12.098 +0000 WARN TcpOutputFd - Connect to idx3:9996 failed. No route to host
02-07-2020 11:56:13.804 +0000 WARN TcpOutputFd - Connect to idx1:9996 failed. No route to host
Just an update:
The issue was NOT splunk issue but instead turned out to be firewalld that was accidentally configured by unix maintenance team after patching which disabled ALL ports on the server. This affected servers individually hence why some indexers worked and some didn't.
After firewalld was disabled on all the indexers/CM, all worked fine again and the All indexers were visible on CM's dashboard again. Deployment server was also now reached from GUI and it was able to send internal logs to indexers + 8089 on DS began working fine as expected.
solution : turn off firewalld on each server
thanks again @nickhillscpl for your input which has led to me finding out the issue.
Following two commands that I ran in Splunk server, helped me to figure out the issue and providing a solution.
Check the port is available
lsof -i:8088
Open the port
firewall-cmd --add-port 8088/tcp
My Splunk server runs on CentOS 7
Just an update:
The issue was NOT splunk issue but instead turned out to be firewalld that was accidentally configured by unix maintenance team after patching which disabled ALL ports on the server. This affected servers individually hence why some indexers worked and some didn't.
After firewalld was disabled on all the indexers/CM, all worked fine again and the All indexers were visible on CM's dashboard again. Deployment server was also now reached from GUI and it was able to send internal logs to indexers + 8089 on DS began working fine as expected.
solution : turn off firewalld on each server
thanks again @nickhillscpl for your input which has led to me finding out the issue.
All those messages are telling you that you have connectivity errors.
Regardless of what the network team says, If Splunk is running on all your instances, you have a networking issue.
Check DNS, Layer3, Layer2, Firewalls
Well understood. The networking team have discounted it as application specifc issue not firewall which i disagree with for reasons you mention above.
But i want to clarify is, those issues are NOT being caused by splunk (or any of its configrations) but are instead being cuased by the routing/networking configurations on the servers/OS right?
Correct " No route to host" means that the host is unreachable from the instance generating it.
Can you ping the indexers from one another (also will depend on FW rules etc), but its a good thing to test.
TCPDumping on 8089 (or your replication ports) is another good test - at a guess you will see indexers opening connections but not recieving anything back
Yes but splunkd remain running.
general PING without specifying works. However running a telnet between the indexers at port 8089 fails.
Trying myidx
no route to host
Failed to connect to myidx:8089; No route to host
closing connection 0
curl:(7) Failed connect to myidx:8089;No route to host
Thats on all indexer<>Indexer comms except the only one that is working for some reason which connects at 8089.
This is also the case for DS trying to connect to indexers at 9996...same above error.
From what i can see and the troubleshooting i have done, it all seems beyond splunk and OS/network related or am i missing something.
try locally on one of the 'bad' indexers
curl -k https://localhost:8089
If you get an XML response locally, you have proved its a comms issue
hmm. What does that command prove/verify?
and yes great, i am able to get an XML response. So does this conclude the issue is indeed not Splunk as an application and more to do with OS/networking layers/routing issue of the server ?
That command connects from the server you run it on to itself. (lets call that server "serverA.yourco.com")
If you get an XML response, then it means the application is responding to itself
If you run the command remotely eg from "serverB"
curl -k https://serverA.yourco.com:8089
and you do not get the same response, then it proves that "serverB" can not talk to "serverA" on port 8089
Oh ok got it! thanks.
So it gets odd from here. I am able to connect (get XML response) to all but 2 of the indexers. i am also able to connect to the CM with that command!
So if i can connect to the CM, why isnt this indexer actually connecting to the CM at 8089 and showup on the CM's dashboard?
This has confused me more now.
Are you sure somthing hasn't "fixed itself" in the same way it "broke itself" 🙂
If your cluster is unhealthy, i would restart the CM, then when it is restarted put it in maintenance mode.
Wait until all the peers are back online, and then take it out of maintenance mode, and let it run its fixup,
If the data in this cluster is business critical, it may well be worth opening a ticket with support to get some proper assistance with the rebuild if you are not comfortable with the process.
No nothing fixed itself only 1/6 indexers is showing on my CM dashboard, so the cluster is in an unhealthy loop as it can't meet RF/SF.
BUT... wheni ran the above command from CM <>indexers rather than idx<>idx and i am not able to establish connection at all. Only able to establish connection with the only working indexer. so does this mean cm server is at fault here when its originating.
From the CM, if you ping each of the DNS names for the indexers, does it resolve to the correct IP address?
And vice versa, if you ping the CM from each IDX, does it resolve the correct IP for the CM?
We are using ip addresses rather than DNS so that would not work. But when i ping indexers from CM, no connection at 8089.
PING works fine, just port specific becomes an issue.
But when the idx are the origin, they can connect to CM fine just not the other way round. That might explain the "CMRepJob" errors on the CM as it can't connect to the indexers?
Simply trying to rule out whether it is a splunk issue or OS/server routing issue.
Very Odd behaviour. I have enabled maintenence mode now on the CM. Will chase this up with networks and raise a ticket if all fails.
What has changed in your cluster since it was working previously? Reboots? OS Upgrade? Patch? Other?
In your Splunk instance, how are you referring to theses servers in the configs? IP or hostname? If the latter, have you tried a nslookup on them? If DNS fails, obviously the traffic will fail.
Also, I would ask your network team if they are certain the ACL is bi-directional.
We are using IP addresses so no DNs at all. Although i havent asked about what ACL mode is being used