Controller does not properly add node metadata
TL;DR
I set up talos.dev cluster on hcloud and I am expecting that the HCCM will populate node objects with metadata and I will be able to order load balancers.
Expected behavior
The node objects are populated with metadata. The load balancers are created. There are no errors in logs of HCCM
Observed behavior
I set up the cluster according to the instructions here: https://www.talos.dev/v1.6/talos-guides/install/cloud-platforms/hetzner/
I introduced several changes. First of all, I created a virtual machines with private network attached. Then I prepared a talos patch file looking like:
$ cat patch.yaml
cluster:
network:
cni:
name: none
podSubnets:
- 100.64.0.0/16
serviceSubnets:
- 100.96.0.0/16
proxy:
disabled: true
etcd:
advertisedSubnets:
- 10.0.0.0/8
machine:
kubelet:
extraArgs:
cloud-provider: external
nodeIP:
validSubnets:
- 10.0.0.0/8
and applied it when creating the cluster. The idea was to use the private subnet to join cluster nodes and avoid using public subnets for the cluster connectivity.
Minimal working example
No response
Log output
nodes
kubectl get nodes -owide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
talos-control-plane-1 Ready control-plane 52m v1.29.1 10.0.0.2 <none> Talos (v1.6.3) 6.1.74-talos containerd://1.7.11
talos-control-plane-2 Ready control-plane 4d4h v1.29.1 10.0.0.3 <none> Talos (v1.6.3) 6.1.74-talos containerd://1.7.11
talos-control-plane-3 Ready control-plane 4d4h v1.29.1 10.0.0.4 <none> Talos (v1.6.3) 6.1.74-talos containerd://1.7.11
talos-worker-1 Ready <none> 51m v1.29.1 10.0.0.5 <none> Talos (v1.6.3) 6.1.74-talos containerd://1.7.11
talos-worker-2 Ready <none> 50m v1.29.1 10.0.0.6 <none> Talos (v1.6.3) 6.1.74-talos containerd://1.7.11
the logs of CCM:
I0215 21:44:34.638664 1 node_controller.go:431] Initializing node talos-worker-2 with cloud provider
--- Request:
GET /v1/servers?name=talos-worker-2 HTTP/1.1
Host: api.hetzner.cloud
User-Agent: hcloud-cloud-controller/v1.19.0 hcloud-go/2.4.0
Authorization: REDACTED
Accept-Encoding: gzip
--- Response:
HTTP/2.0 200 OK
Content-Length: 5787
Access-Control-Allow-Credentials: true
Access-Control-Allow-Headers: DNT,Keep-Alive,User-Agent,X-Requested-With,If-Modified-Since,Cache-Control,Content-Type,Range,Authorization
Access-Control-Allow-Methods: GET, PUT, POST, DELETE, PATCH, OPTIONS
Access-Control-Allow-Origin: *
Access-Control-Max-Age: 1728000
Content-Type: application/json
Date: Thu, 15 Feb 2024 21:44:34 GMT
Link: <https://api.hetzner.cloud/v1/servers?name=talos-worker-2&page=1>; rel=last
Ratelimit-Limit: 3600
Ratelimit-Remaining: 3569
Ratelimit-Reset: 1708033505
Strict-Transport-Security: max-age=31536000; includeSubDomains
X-Correlation-Id: 8eff8b5832344a5e
{
"servers": [
{
"id": 43268742,
"name": "talos-worker-2",
"status": "running",
"created": "2024-02-11T17:21:28+00:00",
"public_net": {
"ipv4": {
"ip": "65.108.90.3",
"blocked": false,
"dns_ptr": "static.3.90.108.65.clients.your-server.de",
"id": 51533159
},
"ipv6": {
"ip": "2a01:4f9:c011:bd8c::/64",
"blocked": false,
"dns_ptr": [],
"id": 51533160
},
"floating_ips": [],
"firewalls": []
},
"private_net": [
{
"network": 3866040,
"ip": "10.0.0.6",
"alias_ips": [],
"mac_address": "86:00:00:77:de:83"
}
],
"server_type": {
"id": 98,
"name": "ccx33",
"description": "CCX33 Dedicated CPU",
"cores": 8,
"memory": 32.0,
"disk": 240,
"deprecated": false,
"prices": [
{
"location": "fsn1",
"price_hourly": {
"net": "0.0769000000",
"gross": "0.0769000000000000"
},
"price_monthly": {
"net": "47.9900000000",
"gross": "47.9900000000000000"
}
},
{
"location": "nbg1",
"price_hourly": {
"net": "0.0769000000",
"gross": "0.0769000000000000"
},
"price_monthly": {
"net": "47.9900000000",
"gross": "47.9900000000000000"
}
},
{
"location": "hel1",
"price_hourly": {
"net": "0.0769000000",
"gross": "0.0769000000000000"
},
"price_monthly": {
"net": "47.9900000000",
"gross": "47.9900000000000000"
}
},
{
"location": "ash",
"price_hourly": {
"net": "0.0769000000",
"gross": "0.0769000000000000"
},
"price_monthly": {
"net": "47.9900000000",
"gross": "47.9900000000000000"
}
},
{
"location": "hil",
"price_hourly": {
"net": "0.0769000000",
"gross": "0.0769000000000000"
},
"price_monthly": {
"net": "47.9900000000",
"gross": "47.9900000000000000"
}
}
],
"storage_type": "local",
"cpu_type": "dedicated",
"architecture": "x86",
"included_traffic": 32985348833280,
"deprecation": null
},
"datacenter": {
"id": 3,
"name": "hel1-dc2",
"description": "Helsinki 1 virtual DC 2",
"location": {
"id": 3,
"name": "hel1",
"description": "Helsinki DC Park 1",
"country": "FI",
"city": "Helsinki",
"latitude": 60.169855,
"longitude": 24.938379,
"network_zone": "eu-central"
},
"server_types": {
"supported": [
1,
3,
5,
7,
9,
22,
23,
24,
25,
26,
45,
93,
94,
95,
96,
97,
98,
99,
100,
101
],
"available": [
1,
3,
5,
7,
9,
22,
23,
24,
25,
26,
45,
93,
94,
95,
96,
97,
98,
99,
100,
101
],
"available_for_migration": [
1,
3,
5,
7,
9,
22,
23,
24,
25,
26,
45,
93,
94,
95,
96,
97,
98,
99,
100,
101,
102,
103
]
}
},
"image": {
"id": 148619575,
"type": "snapshot",
"status": "available",
"name": null,
"description": "talos system disk - amd64 - v1.6.3",
"image_size": 0.2891603486328125,
"disk_size": 20,
"created": "2024-02-08T12:26:41+00:00",
"created_from": {
"id": 43135373,
"name": "packer-65c4c7e6-96b2-8b71-a041-16c6cc71e1a0"
},
"bound_to": null,
"os_flavor": "debian",
"os_version": null,
"rapid_deploy": false,
"protection": {
"delete": false
},
"deprecated": null,
"labels": {
"os": "talos",
"arch": "amd64",
"type": "infra",
"version": "v1.6.3"
},
"deleted": null,
"architecture": "x86"
},
"iso": null,
"rescue_enabled": false,
"locked": false,
"backup_window": null,
"outgoing_traffic": 54605000,
"ingoing_traffic": 13552325000,
"included_traffic": 32985348833280,
"protection": {
"delete": false,
"rebuild": false
},
"labels": {
"type": "worker"
},
"volumes": [
100380895
],
"load_balancers": [],
"primary_disk_size": 240,
"placement_group": null
}
],
"meta": {
"pagination": {
"page": 1,
"per_page": 25,
"previous_page": null,
"next_page": null,
"last_page": 1,
"total_entries": 1
}
}
}
E0215 21:44:34.851551 1 node_controller.go:240] error syncing 'talos-worker-2': failed to get node modifiers from cloud provider: provided node ip for node "talos-worker-2" is not valid: failed to get node address from cloud provider that matches ip: 10.0.0.6, requeuing
Additional information
No response
So - shortly - it looks like when nodes have ONLY internal IPs from private hetzner network, for some reason HCCM could not match them.
When running the cluster on the nodes with only public addresses - no issues:
kubectl get nodes -owide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
talos-control-plane-1 Ready control-plane 23h v1.29.1 <none> 37.27.38.153 Talos (v1.6.3) 6.1.74-talos containerd://1.7.11
talos-control-plane-2 Ready control-plane 23h v1.29.1 <none> 168.119.189.58 Talos (v1.6.3) 6.1.74-talos containerd://1.7.11
talos-control-plane-3 Ready control-plane 23h v1.29.1 <none> 94.130.150.142 Talos (v1.6.3) 6.1.74-talos containerd://1.7.11
talos-worker-1 Ready <none> 23h v1.29.1 <none> 65.108.90.3 Talos (v1.6.3) 6.1.74-talos containerd://1.7.11
talos-worker-2 Ready <none> 23h v1.29.1 <none> 65.21.152.91 Talos (v1.6.3) 6.1.74-talos containerd://1.7.11
kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
cilium-9mmcj 1/1 Running 0 23h
cilium-lr87f 1/1 Running 0 23h
cilium-nn795 1/1 Running 0 23h
cilium-operator-6d6fb6b85f-2n2g6 1/1 Running 0 23h
cilium-operator-6d6fb6b85f-tt5d2 1/1 Running 0 23h
cilium-rp9w6 1/1 Running 0 23h
cilium-xwt47 1/1 Running 0 23h
coredns-85b955d87b-tm47c 1/1 Running 0 23h
coredns-85b955d87b-vx9zg 1/1 Running 0 23h
hcloud-cloud-controller-manager-584f6fc4f4-w6zk2 1/1 Running 0 22h
hcloud-csi-controller-68f987547f-cz9cz 5/5 Running 0 22h
hcloud-csi-node-75pps 3/3 Running 0 22h
hcloud-csi-node-85xlm 3/3 Running 0 22h
hcloud-csi-node-927pf 3/3 Running 0 22h
hcloud-csi-node-9w5sz 3/3 Running 0 22h
hcloud-csi-node-nl94s 3/3 Running 0 22h
kube-apiserver-talos-control-plane-1 1/1 Running 0 23h
kube-apiserver-talos-control-plane-2 1/1 Running 0 23h
kube-apiserver-talos-control-plane-3 1/1 Running 0 23h
kube-controller-manager-talos-control-plane-1 1/1 Running 2 (23h ago) 23h
kube-controller-manager-talos-control-plane-2 1/1 Running 0 23h
kube-controller-manager-talos-control-plane-3 1/1 Running 1 (23h ago) 23h
kube-scheduler-talos-control-plane-1 1/1 Running 2 (23h ago) 23h
kube-scheduler-talos-control-plane-2 1/1 Running 0 23h
kube-scheduler-talos-control-plane-3 1/1 Running 1 (23h ago) 23h
Hey @gecube,
the error happens because hccm reports a different set of addresses for the node than the node currently has. From the error message, API requests (thanks for including them!) and the kubectl output I think the addresses reported are:
-
Node(through kubelet):-
InternalIP10.0.0.6
-
- HCCM:
-
ExternalIP65.108.90.3
-
This causes a conflict, because the library we use (kubernetes/cloud-provider) expects that hccm returns all addresses that are already specified on the node -> No removals allowed.
HCCM only returns the ExternalIP because IPs from a network are only returned if you specify the ID or Name of the network in your configuration. We need this, as a server might be in multiple networks, but only one InternalIP makes sense here.
You can do this by setting the HCLOUD_NETWORK environment variable to the ID or Name of the Network your nodes are attached to.
If you want to run a cluster without public network access, you will need to make some more configuration, as this means that your nodes will node be able to pull images or access the Hetzner Cloud API. If you only want your intra-cluster communication through the private network, that should be enough.
If you also want to use the Routing functionality, you will need to make some more configuration to your CNI & the HCCM manifests. See https://github.com/hetznercloud/hcloud-cloud-controller-manager/blob/main/docs/deploy_with_networks.md
@apricote Hi! Thanks for your considerations. So the only reason could be that I forgot HCLOUD_NETWORK ? It is a little bit weird as I am sure that I set it up in the secret... and I don't remember a relevant error messages in logs. I will make one more experiment to check.
Not sure how you installed HCCM (Yaml Manifests, Helm Chart,..). But this is the related excerpt from the readme:
If you manage the network yourself it might still be required to let the CCM know about private networks. You can do this by adding the environment variable with the network name/ID in the CCM deployment.
env: - name: HCLOUD_NETWORK valueFrom: secretKeyRef: name: hcloud key: networkYou also need to add the network name/ID to the secret:
kubectl -n kube-system create secret generic hcloud --from-literal=token=<hcloud API token> --from-literal=network=<hcloud Network_ID_or_Name>.
As far as I remember there is no error message, as its an optional configuration value and nodes may or may not be attached a network that should be used for in-cluster communication. But maybe the attached network is also for another service, proxy, .. so adding a log for whenever no network was configured but the Node has a network has the potential to spam the logs.
We could add a log that is only sent once when no network is configured, but a node with network is processed. Then set some internal variable to "silence" this until the process is restarted.
This issue has been marked as stale because it has not had recent activity. The bot will close the issue if no further action occurs.
@apricote @gecube I am facing the exact same issue and i verified the environment variables are set correctly
HCCM version: 1.20.0
Environment variables:
- HCLOUD_NETWORK env variable set correctly
- HCLOUD_NETWORK_ROUTES_ENABLED to false
- ROBOT_USER and ROBOT_PASSWORD
- HCLOUD_TOKEN
- ROBOT_ENABLED to true
k3s 1.29.5 Cloud Servers with dedicated server connection over vSwitch.
I manually set the hrobot providerID but still getting
error syncing 'SERVER_NAME': failed to get node modifiers from cloud provider: provided node ip for node "SERVER_NAME" is not valid: failed to get node address from cloud provider that matches ip: x.x.x.x, requeuing
Any advice how to fix this?
@paprickar Hi! Thanks for your report. We will check on our side what changed since last observations. Also I want to mention that I am not linked to Hetzner in anyway and I am independent engineer.
was usefull for me
env:
- name: HCLOUD_NETWORK
valueFrom:
secretKeyRef:
name: hcloud
key: network
klin@asus:~/stav$ kgn -owide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
preprod-talos-controlplane-1 Ready control-plane 10h v1.30.3 10.0.2.2 XXX Talos (v1.7.6) 6.6.43-talos containerd://1.7.18
preprod-talos-controlplane-2 Ready control-plane 10h v1.30.3 10.0.2.3 XXX Talos (v1.7.6) 6.6.43-talos containerd://1.7.18
preprod-talos-controlplane-3 Ready control-plane 10h v1.30.3 10.0.2.4 XXX Talos (v1.7.6) 6.6.43-talos containerd://1.7.18
preprod-talos-ingress-1 Ready <none> 10h v1.30.3 10.0.2.6 XXX Talos (v1.7.6) 6.6.43-talos containerd://1.7.18
preprod-talos-ingress-2 Ready <none> 10h v1.30.3 10.0.2.5 XXX Talos (v1.7.6) 6.6.43-talos containerd://1.7.18
preprod-talos-worker-1 Ready <none> 10h v1.30.3 10.0.2.7 XXX Talos (v1.7.6) 6.6.43-talos containerd://1.7.18