`Create Elemental Cluster` not working
What steps did you take and what happened:
Hi everyone, it's been several days that I am struggling to setup an elemental cluster,
I am using the official rancher server helm chart for 2.11.2, with elemental plugin, and the server-url is set and reachable from the public internet.
Then I generate an ISO for the bare-metal "SL Micro6.1 ISO v2.2.0-4.3-linux/86_64".
I boot it on hyperV (8GB RAM, 80GB disk) with UEFI, secure, TPM activated.
The VM installation starts, the machine is visible "Active" in the "Inventory of Machines", then after the auto-restart everything seems fine beside one error logs: [FAILED] Failed to start elemental health check (is it relevant ?) mixed with plenty of OK logs.
Then I try to "Create Elemental Cluster" on the machine, plenty of logs are printed but then it stops, and the VM prints "unable to decode an event from the watch stream: INTERNAL_ERROR: received from peer", and the rancher UI says: "waiting for cluster agent to connect".
And the cluster stays stuck in "Updating" ...
I tried several combinations of OS and cluster versions (RKE2, K3S), but the issue is still the same, does someone have any idea ?
What did you expect to happen:
the cluster in the provisioned VM should be "Active"
Anything else you would like to add: Weird log in the VM first provisioning:
Error logs in the VM and rancher UI when trying to Create Elemental Cluster:
Environment:
- Elemental release version (use
cat /etc/os-release):elemental-operator:1.6.9 - Rancher version:
2.11.2 - Kubernetes version (use
kubectl version):v1.32.5-eks-5d4a308 - Cloud provider or hardware configuration: AWS
Thank you for reporting this. We are trying to understand what is going on. Have your tried using Micro 6.0?
Dear @chargio, thank you very much for the reply.
As you proposed I tried with the SL Micro6.0 ISO v2.1.3-5.4-linux/86_64, it still stays stuck at the cluster creation. But the error message is different this time in the VM:
[FATAL] Abording system-agent installation due to requested strict CA verification with no CA checksum provided"
From the rancher UI I have:
waiting for agent to check in and apply initial plan
I tried with both RKE2 1.30 and 1.32.
Also I wonder if I have a network issue, but I don't see what it can be, in the chart only classic http ports are mentioned:
https://github.com/rancher/rancher/blob/main/chart/templates/ingress.yaml#L45-L67
https://github.com/rancher/rancher/blob/main/chart/templates/service.yaml#L18-L28
I am using an ALB in aws in front of the ingress for the rancher server side, and I tried with both a classical switch and a bridge switch on the VM side. I tried with several networks (home / phone), and it's always the same issue.
I guess that if I managed to reach the rancher frontend from my browser it validates the whole communication ?
Or is the elemental VM using additional protocols that I should check to access the rancher server ?
Is there a customer ticket associated? because we are having problems replicating the issue