I am trying to use invalid_replication_address to tell a cluster master running in front of an ELB to contact the indexer on a different address. However when i try to add the peer I get the following error on the CM:
REST_Calls - app=search POST cluster/master/peers/ id=526D8BF5-7412-4934-AC47-08C699290CC9: active_bundle_id -> [14310A4AABD23E85BBD4559C4A3B59F8], add_type -> [Initial-Add], base_generatio
n_id -> [0], batch_serialno -> [1], batch_size -> [2], buckets -> [], forwarderdata_rcv_port -> [9997], forwarderdata_use_ssl -> [0], indexes -> [], last_complete_generation_id -> [0], latest_bundle_id -> [14310A4AABD23E85BBD
4559C4A3B59F8], mgmt_port -> [8089], register_forwarder_address -> [], register_replication_address -> [https://10.0.7.181:8089], register_search_address -> [], replication_port -> [9887], replication_use_ssl -> [0], replicat
ions -> [], server_name -> [ip-10-0-7-181.ca-central-1.compute.internal], site -> [default], splunk_version -> [8.0.2], splunkd_build_number -> [a7f645ddaf91], status -> [Up]
INFO AdminManager - Setting capability.write=edit_indexer_cluster for handler clustermasterpeers.
INFO AdminManager - Setting capability.read=edit_indexer_cluster for handler clustermasterpeers.
DEBUG AdminManager - Validating argument values...
DEBUG AdminManagerValidation - Validating rule='validate(len(name) < 1024, 'Parameter "name" must be less than 1024 characters.')' for arg='name'.
ERROR ClusterMasterPeerHandler - Invalid host name https://10.0.7.181:8089
DEBUG AdminManager - URI /services/cluster/master/peers/?output_mode=json generated an AdminManagerExceptionBase exception in handler 'clustermasterpeers': Invalid host name https://10.0.7.181:80
89
INFO CMSlave - event=addPeer status=failure shutdown=false request: AddPeerRequest: { _id= _indexVec=''active_bundle_id=14310A4AABD23E85BBD4559C4A3B59F8 add_type=Initial-Add base_generation_id=0 batch_serialno=1 batch_size=2 forwarderdata_rcv_port=9997 forwarderdata_use_ssl=0 last_complete_generation_id=0 latest_bundle_id=14310A4AABD23E85BBD4559C4A3B59F8 mgmt_port=8089 name=526D8BF5-7412-4934-AC47
08C699290CC9 register_forwarder_address= register_replication_address=https://10.0.7.181:8089 register_search_address= replication_port=9887 replication_use_ssl=0 replications= server_name=ip-10-0-7-181.ca-central-1 compute.internal site=default splunk_version=8.0.2 splunkd_build_number=a7f645ddaf91 status=Up }
04-23-2020 02:03:56.478 +0000 ERROR CMSlave - event=addPeer start over and retry after sleep 12800ms reason addType=Initial Add Batch SN=1/2 failed. add_peer_network_ms=5
Notice how it says something regarding the name being less than 1024 characters and it possibly failing validation?
The Cluster Master can "resolve" the IP ..although its an IP so see no reason why it should resolve it although the "null" cant resolve is weird.. I added a hostfile..no diffference:
`
nslookup: can't resolve '(null)'
Name: 10.0.7.181
Address 1: 10.0.7.181 ip-10-0-7-181.ca-central-1.compute.internal
Ncat: Version 7.70 ( https://nmap.org/ncat )
The Clustermaster can reach the Indexer on that port:
Ncat: Connected to 10.0.7.181:8089.
`
Any reason why this happens?
I've read a few posts and register_replication_address seems to be the solution to my problem however i am unsure why it is "unable to resolve hostname"
*** UPDATE ***
I also want to add here i've been doing more testing on some nodes that are just two EC2 instances with all traffic allowed between each other. nslookup on AWS for the IPs are fine and I still cannot get this working. If i remove register_replication_address in these cases it will work fine..this is really weird. Im not sure what the issue is or how to troubleshoot further if the log just says "invalid hostname"
I think your problem is because of the wrong value on the "register_replication_address" parameter which seems "https://10.0.7.181:8089" in your error logs. This parameter must be an IP address or fully qualified machine/domain name.
You can test with below setting on server.conf
register_replication_address = 10.0.7.181
If you are using indexer discovery you will need to set "register_forwarder_address" too.
I think your problem is because of the wrong value on the "register_replication_address" parameter which seems "https://10.0.7.181:8089" in your error logs. This parameter must be an IP address or fully qualified machine/domain name.
You can test with below setting on server.conf
register_replication_address = 10.0.7.181
If you are using indexer discovery you will need to set "register_forwarder_address" too.
omg that is exactly the problem!!
I cant believe its so dumb lol.. the second question I guess would be how can we change the mgmt port from 8089, like if i wanna use 443 only for communication with the indexer is that possible?
I will accept this as the answer btw, this is definitely the solution.
The CM uses "register_replication_address" to communicate with the indexers. When an indexer attempts to join, the CM will ping the indexers IP, or "register_replication_address" if "register_replication_address" is set. This is a check that the two-way communication between CM <--> Indexers work before allowing an Indexer in.
If its not reachable, the indexer is not allowed in.
I added security groups to allow all traffic (CM wasnt alowed to ping indexer before) but still..same error..
Also by two way communication you mean ICMP or communication on the specified port for the register replication address?
Replication port has nothing to do with this right? That's the port the indexers will use to communicate with each other for replication?
whatever you put in "register_replication_address" is exactly how the entire cluster will communicate to that indexer. make sure its valid - see if u can telnet to that address and port...
The CM can reach out to the indexer on that port and even list APIs when i do curl.. when i do a tcpdump im not even seeing the connections from the cluster master..
`curl https://10.0.7.181:8089 -k
https://10.0.7.181:8089/
2020-04-23T22:56:35+00:00
<name>Splunk</name>
<title>rpc</title>
<id>https://10.0.7.181:8089/rpc</id>
<updated>1970-01-01T00:00:00+00:00</updated>
<link href="/rpc" rel="alternate"/>
`