dns_cluster icon indicating copy to clipboard operation
dns_cluster copied to clipboard

DNS discovery fails when RELEASE_NODE is FQDN

Open j4nk3e opened this issue 1 year ago • 7 comments

I tried getting dns_cluster working with FQDNs, however the discovery always fails when I set a FQDN in the RELEASE_NODE env with the error message Cannot get connection id for node app_name@prod-01.<redacted>.com coming from erlang. When I call Node.connect :"app_name@prod-01.<redacted>.com" manually, it connects and everything works fine. I have an A record for <redacted>.com with the IP addresses for prod-01 and prod-02, the two machines which connect together, and they have full TCP access to each other. The hostname of both machines is set to their own FQDN, and I created an A record for each of those, too.

The documentation clearly states:

export RELEASE_NODE="myapp@fully-qualified-host-or-ip"

so I assume, a FQDN as node name is supported by dns_cluster.

Replacing the FQDN in RELEASE_NODE with the public IP address, the dns discovery starts working. However I would prefer to use domain names in the config to make the setup more robust in case a machine gets a new IP address.

Am I doing something wrong, or might this be an issue somewhere in the dns_cluster lib? Unfortunately I can't figure out where the Cannot get connection id for node error is coming from.

j4nk3e avatar Jan 24 '25 15:01 j4nk3e

Can you resolve the DNS if you run Elixir on that node and simulate the code in this library? The DNS queries it runs are small, so you can try reproducing it.

josevalim avatar Jan 24 '25 15:01 josevalim

Yes, I tried running :inet_res.getbyname("<redacted>.com" |> String.to_charlist, :a) and got back the IPs of both machines, as configured in the DNS entry for my cluster. edit: I'll try running everything line by line again, I couldn't reproduce the error yet.

j4nk3e avatar Jan 24 '25 16:01 j4nk3e

It seems that Node.connect works fine with the hostname, but fails when trying to connect using an IP address instead. dns_cluster always tries to connect using the IPs which it got from the resolver.lookup call. If I understand the problem correctly, we either have to find out the hostname for the other machine (we only have the IP returned by the DNS), or convince Erlang to allow connecting to a node by IP which has the RELEASE_NODE set to a hostname.

j4nk3e avatar Jan 24 '25 17:01 j4nk3e

The node and the name has to match. So if you name your machine --name foo@hostname, then you need to connect using --name foo@hostname. If you name it using an IP, then you connect using the IP.

josevalim avatar Feb 18 '25 18:02 josevalim

We had a look into using this library for our applications, but we have a similar problem 🤔

Scenario

Our nodes are having the hostname instead of the ip address in their RELEASE_NODE:

# e.g. [email protected]
export RELEASE_NODE=<% @release.name %>@$(hostname -f)

Nodes in the cluster:

When connecting to the remote shell it's very clear on which node you're connected to, because you see the node name including the specific host (e.g. app-1) and the environment (e.g. dev). Seeing only the ip address does not tell you whether it's a prod or non-prod node.

Currently we're doing a lookup for a SRV record and use the hostnames instead of the ips from the result:

$ dig SRV app.dev.acme.cloud
; <<>> DiG 9.9.5-9+deb8u19-Debian <<>> app.dev.acme.cloud
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 1957
;; flags: qr aa rd; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;app.dev.acme.cloud. IN A

;; ANSWER SECTION:
app.dev.acme.cloud. 4 IN SRV 0 16 4000 app-1.dev.acme.cloud.
app.dev.acme.cloud. 4 IN SRV 0 16 4000 app-2.dev.acme.cloud.
...

;; ADDITIONAL SECTION:
app-1.dev.acme.cloud. 4 IN A 192.168.100.1
app-2.dev.acme.cloud. 4 IN A 192.168.100.2
...

;; Query time: 1 msec
;; SERVER: 10.96.0.10#53(10.96.0.10)
;; WHEN: Tue Oct 14 09:26:01 UTC 2025
;; MSG SIZE  rcvd: 181

We're then doing Node.connect(:"[email protected]"). I guess that's mostly what Cluster.Strategy.Kubernetes.DNSSRV does.

The dns_cluster library currently translates the DNS lookup result into a list of ip addresses. I guess that's fine in most setups, but it doesn't work for us 🤔

Ideas

I thought of providing a PR to optionally skip the call to lookup_hosts/1, but it felt wrong 😕

  • Passing another parameter to the resolver's lookup function
  • Adding another "virtual" resource type (e.g. :srv_host), but how to handle cases where :srv_host and another resource type is given?

Both ideas are potentially causing breaking changes...

Maybe it makes more sense to simply use libcluster instead, even though we're not using kubernetes 🤔

@josevalim WDYT?

janpieper avatar Oct 14 '25 10:10 janpieper

My suggestion is to introduce an option called :address_type, default :ip. But you can support a :hostname value. If hostname, we don't lookup IDs neither for A/AAAA nor SRV records.

josevalim avatar Oct 14 '25 11:10 josevalim

@josevalim I've created #20. Maybe you can have a first look, before I add tests, update documentation, etc.

janpieper avatar Oct 15 '25 09:10 janpieper