cloudstack icon indicating copy to clipboard operation
cloudstack copied to clipboard

CKS: cluster does not fully deploy due to sshd not started on control node

Open DaanHoogland opened this issue 1 year ago • 17 comments

ISSUE TYPE
  • Bug Report
COMPONENT NAME
CKS
CLOUDSTACK VERSION
4.19.0
CONFIGURATION

simple installation with CKS enabled

OS / ENVIRONMENT

4.19 with any hypervisor/network model

SUMMARY

when starting a CKS cluster the control node does not enable ssh and thus the cluster never comes up.

STEPS TO REPRODUCE
enable cks
install a cks image
deploy a cluster
EXPECTED RESULTS
cluster comes up (control node accessable with ssh)
ACTUAL RESULTS
no ssh deamon active on the control node and hence deploy times out.

DaanHoogland avatar May 24 '24 09:05 DaanHoogland

@DaanHoogland can you share the hypervisor type and version ? cks iso link ?

weizhouapache avatar May 24 '24 09:05 weizhouapache

cks tried by me is 1.27.8, but user reported trying several versions. Host os is vmware, and I will verify others and update the description. I am first checking 4.18.1 (and possibly before) to see when it was introduced.

DaanHoogland avatar May 24 '24 09:05 DaanHoogland

and network type, etc

the public ip of the CKS cluster should be accessible from cloudstack mgmt server in some setup, if the mgmt server (in private network) cannot access the cks nodes (via public IP) and get the status of cks cluster, the cluster might end in Error state

weizhouapache avatar May 24 '24 09:05 weizhouapache

@weizhouapache In the tests I did recreating the problem, the connectivity between the public ip of CKS cluster and the managers is enabled and the problem occurs anyway. Also, if one of the nodes is accessed via console and the ssh service is started manually, from the managers I can establish the ssh connection.

luganofer avatar May 24 '24 14:05 luganofer

the problem occurs anyway. Also, if one of the nodes is accessed via console and the ssh service is started manually, from the managers I can establish the ssh connection.

ok @luganofer did it happen every time ? or just once ?

weizhouapache avatar May 24 '24 15:05 weizhouapache

@weizhouapache At least in my lab environment it happens in every deployment I try and with several different k8s versions (1.28.4, 1.27.8, 1.27.3, 1.26.6)

luganofer avatar May 24 '24 16:05 luganofer

@weizhouapache At least in my lab environment it happens in every deployment I try and with several different k8s versions (1.28.4, 1.27.8, 1.27.3, 1.26.6)

@luganofer can you also the hypervisor type and version, the link of cks iso ?

weizhouapache avatar May 24 '24 21:05 weizhouapache

@weizhouapache I am using VMware vSphere 8.0c and all the ISOs were downloaded from the following link: https://download.cloudstack.org/cks/

luganofer avatar May 24 '24 22:05 luganofer

@luganofer As the nodes / VMs come up, do you see any error logs in the VM console?

Pearl1594 avatar Jun 03 '24 20:06 Pearl1594

Hi @Pearl1594, no error logs en console VM.
Only the following error is observed in managers logs:

ERROR [c.c.k.c.a.KubernetesClusterActionWorker] (API-Job-Executor-84:ctx-216a9879 job-150679 ctx-7f68a95d) (logid:e493dc8f) Failed to setup Kubernetes cluster : maradona in usable state as unable to access control node VMs of the cluster

From my perspective, the problem is related to nodes that do not initialise correctly (cloud-init ?). They receive ip by dhcp from the VR, but do not change the hostname and fundamentally do not start the ssh service so the deployed nodes cannot be reached by the acs managers (via ssh) and the correct deployment of the k8s cluster is not completed.

luganofer avatar Jun 05 '24 02:06 luganofer

Hi @Pearl1594, no error logs en console VM.
Only the following error is observed in managers logs:

ERROR [c.c.k.c.a.KubernetesClusterActionWorker] (API-Job-Executor-84:ctx-216a9879 job-150679 ctx-7f68a95d) (logid:e493dc8f) Failed to setup Kubernetes cluster : maradona in usable state as unable to access control node VMs of the cluster

From my perspective, the problem is related to nodes that do not initialise correctly (cloud-init ?). They receive ip by dhcp from the VR, but do not change the hostname and fundamentally do not start the ssh service so the deployed nodes cannot be reached by the acs managers (via ssh) and the correct deployment of the k8s cluster is not completed.

@luganofer If you are able to log into the vm (is the password still "password"?) and restart ssh, can you check the cloud-init logs? /var/log/cloud-init-*

Can you also double check the vmware version? 8.0c, 8.0 update 1c or 8.0 update 2c?

weizhouapache avatar Jun 05 '24 05:06 weizhouapache

cks tried by me is 1.27.8, but user reported trying several versions. Host os is vmware, and I will verify others and update the description. I am first checking 4.18.1 (and possibly before) to see when it was introduced.

Sorry, I forgot to feedback; xcpng and kvm seem to work, just vmware is broken.

DaanHoogland avatar Jun 05 '24 07:06 DaanHoogland

cks tried by me is 1.27.8, but user reported trying several versions. Host os is vmware, and I will verify others and update the description. I am first checking 4.18.1 (and possibly before) to see when it was introduced.

Sorry, I forgot to feedback; xcpng and kvm seem to work, just vmware is broken.

which vmware version did you test? @DaanHoogland It seems to be working in Trillian tests

weizhouapache avatar Jun 05 '24 08:06 weizhouapache

What test is verifying this @weizhouapache ? (as I recall it was 70u3, but I'll check)

DaanHoogland avatar Jun 05 '24 09:06 DaanHoogland

it was 80u1 , @weizhouapache

DaanHoogland avatar Jun 05 '24 09:06 DaanHoogland

it was 80u1 , @weizhouapache

  • 80u1 (8.0.1.0) is not working. See #7572. Do not run 4.18/4.19 test with it. However, 4.20 seems to be working with 80u1.

  • we use 8.0b (8.0.0.2) in Trillian tests with vmware-80. It has been run many times. The test results look good.

  • the reporter uses 8.0c (8.0.0.3, if the version is correct). Maybe we can upgrade trillian vm template from 8.0b to 8.0c and run some tests @DaanHoogland

weizhouapache avatar Jun 05 '24 12:06 weizhouapache

@DaanHoogland there is a known issue that systemvm/cks node is stuck at Starting on vmware 80u1 https://github.com/apache/cloudstack/issues/7572 @DaanHoogland will you move this to 4.20.0.0 milestone and test it later ?

@sureshanaparti is working on vmware 80u1/u2/u3 support in 4.20.0.0

weizhouapache avatar Aug 22 '24 11:08 weizhouapache

if this issue happens with vmware 8.0u1/u2/u3, it should have been addressed by #9625

cc @DaanHoogland @rohityadavcloud @sureshanaparti @JoaoJandre

weizhouapache avatar Sep 10 '24 17:09 weizhouapache

if this issue happens with vmware 8.0u1/u2/u3, it should have been addressed by #9625

cc @DaanHoogland @rohityadavcloud @sureshanaparti @JoaoJandre

@weizhouapache I do not have a VMware 8 env to test this.

Could someone validate if the issue persists after #9625? cc @DaanHoogland @rohityadavcloud @sureshanaparti

JoaoJandre avatar Sep 16 '24 19:09 JoaoJandre

Tested on both 8.0u2 and 8.0u3 both clusters are marked as running, so I think it safe to assume this is solved.

DaanHoogland avatar Sep 17 '24 06:09 DaanHoogland