workspaces-issues icon indicating copy to clipboard operation
workspaces-issues copied to clipboard

[Bug] - Proxmox autoscale keeps deleting new VM

Open vilitik opened this issue 8 months ago • 1 comments

Existing Resources

  • [ x ] Please search the existing issues for related problems
  • [ x ] Consult the product documentation : Docs
  • [ x ] Consult the FAQ : FAQ
  • [ x ] Consult the Troubleshooting Guide : Guide
  • [ x ] Reviewed existing training videos: Youtube

Describe the bug Image Autoscale correctly creates a new VM, but then after ~20 seconds deletes it. And this loop just keeps going. Logs show the reason for deletion (10.10.13.2 is Proxmox): Image

What's even more confusing, the startup script actually gets ran on the VM. If I quickly delete the VM tags in Proxmox, Kasm can't delete it and I can take RDP to diagnose. I used this startup script https://github.com/kasmtech/workspaces-autoscale-startup-scripts/blob/develop/latest/windows_vms/default_kasm_desktop_service_startup_script.txt . Image

It's obvious that Windows can't boot and install the Kasm desktop service within 20 seconds, but I'm at a total dead end on how I can make this timeout longer. I have tried changing each and every timeout setting I can find in the Kasm settings with no success.

To Reproduce On a clean Kasm installation, I followed the docs and this video https://www.youtube.com/watch?v=nXIBGs_WJcs to set up autoscaling.

Workspaces Version Tested 1.17.0.7f020d and 1.17.0.94d3c9

Workspaces Installation Method Single Server

Workspace Server Information (please provide the output of the following commands):

  • uname -a Linux kasm2 6.8.0-60-generic #63-Ubuntu SMP PREEMPT_DYNAMIC Tue Apr 15 19:04:15 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
  • cat /etc/os-release PRETTY_NAME="Ubuntu 24.04.2 LTS" NAME="Ubuntu" VERSION_ID="24.04" VERSION="24.04.2 LTS (Noble Numbat)" VERSION_CODENAME=noble ID=ubuntu ID_LIKE=debian HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" UBUNTU_CODENAME=noble LOGO=ubuntu-logo
  • sudo docker info

Client: Docker Engine - Community Version: 28.1.1 Context: default Debug Mode: false Plugins: buildx: Docker Buildx (Docker Inc.) Version: v0.23.0 Path: /usr/libexec/docker/cli-plugins/docker-buildx compose: Docker Compose (Docker Inc.) Version: v2.5.0 Path: /usr/local/lib/docker/cli-plugins/docker-compose Server: Containers: 10 Running: 10 Paused: 0 Stopped: 0 Images: 18 Server Version: 28.1.1 Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true Using metacopy: false Native Overlay Diff: true userxattr: false Logging Driver: json-file Cgroup Driver: systemd Cgroup Version: 2 Plugins: Volume: local Network: bridge host ipvlan kasmweb/sidecar:1.2 macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog Swarm: inactive Runtimes: runc io.containerd.runc.v2 Default Runtime: runc Init Binary: docker-init containerd version: 05044ec0a9a75232cad458027ca83437aae3f4da runc version: v1.2.5-0-g59923ef init version: de40ad0 Security Options: apparmor seccomp Profile: builtin cgroupns Kernel Version: 6.8.0-60-generic Operating System: Ubuntu 24.04.2 LTS OSType: linux Architecture: x86_64 CPUs: 4 Total Memory: 15.62GiB Name: kasm2 ID: fcac6a68-192d-4853-85b0-e0287017474b Docker Root Dir: /var/lib/docker Debug Mode: false Experimental: false Insecure Registries: ::1/128 127.0.0.0/8 Live Restore Enabled: false

  • sudo docker ps | grep kasm df682befaa30 kasmweb/proxy:1.17.0 "/docker-entrypoint.…" About an hour ago Up About an hour 80/tcp, 0.0.0.0:443->443/tcp, [::]:443->443/tcp kasm_proxy 36de6d9799d2 kasmweb/rdp-https-gateway:1.17.0 "/opt/rdpgw/rdpgw" About an hour ago Up About an hour (healthy) kasm_rdp_https_gateway 6659472b9dc9 kasmweb/share:1.17.0 "python3 /src/api_se…" About an hour ago Up About an hour (healthy) 8182/tcp kasm_share 809ee0edb04b kasmweb/rdp-gateway:1.17.0 "/start.sh" About an hour ago Up About an hour (healthy) 0.0.0.0:3389->3389/tcp, [::]:3389->3389/tcp kasm_rdp_gateway a9b819abe211 kasmweb/agent:1.17.0 "python3 /src/Provis…" About an hour ago Up About an hour (healthy) 4444/tcp kasm_agent 0a333136598d kasmweb/kasm-guac:1.17.0 "/dockerentrypoint.sh" About an hour ago Up About an hour (healthy) kasm_guac c2fbb6f36186 kasmweb/manager:1.17.0 "python3 /src/api_se…" About an hour ago Up About an hour (healthy) 8181/tcp kasm_manager 32b6f5852950 kasmweb/api:1.17.0 "/bin/sh -c /usr/bin…" About an hour ago Up About an hour (healthy) 8080/tcp kasm_api 2712e7c39f17 redis:5-alpine "docker-entrypoint.s…" About an hour ago Up About an hour 6379/tcp kasm_redis c4dfd260c38b kasmweb/postgres:1.17.0 "docker-entrypoint.s…" About an hour ago Up About an hour (healthy) 5432/tcp kasm_db

Additional context Add any other context about the problem here.

vilitik avatar May 18 '25 18:05 vilitik

I think I found the root cause for this issue. Digging into the Proxmox logs, I believe the guest-exec timeout is too short. When my host is at low load, it can successfully create new servers. But when load increases, looping issue occures and the following log messages are shown in Proxmox: VM 2000 qmp command failed - VM 2000 qmp command 'guest-ping' failed - got timeout

guest-ping obviously fails because Windows (and guest agent) haven't started yet. My recommendation is to set the timeout value to something much longer than 5 seconds (or maybe even allow adding a custom timeout value in the GUI?).

vilitik avatar May 19 '25 16:05 vilitik

We have made updates to Kasm Workspaces that are included in 1.17.0-rolling service containers that help mitigate this. After digging into both our code and the proxmox code we found two issues that resulted in similar behavior:

  • Kasm was not setting the timeout for Proxmox calls. This results in calls to proxmox timing out before the command can complete on the host. This produces the first error type we see, "HTTPSConnectionPool(host='10.10.13.2', port=8006): Read timed out. (read timeout=5)"
  • We believe Proxmox may have a bug that occurs intermittently when running commands on a host, usually under conditions of higher resource stress. It pertains specifically to commands that execute scripts, and seems to be masked by the timeout logic of the Proxmox agent. This produces the second type of error seen in this bug, those that are similar to "VM 2000 qmp command failed - VM 2000 qmp command 'guest-ping' failed - got timeout"

Our patch fixes the first issue. Additionally, we have added retry logic to mitigate the second issue, but it can still occur due to being in Proxmox's domain. Our team opened a ticket with Proxmox to track the problem. https://bugzilla.proxmox.com/show_bug.cgi?id=6457

Here is a Knowledge base article for updating your deployment to use the rolling Kasm Workspaces service containers. https://kasmweb.atlassian.net/servicedesk/customer/portal/3/article/9240595 Follow the instructions for "Update Kasm services containers"

rickkoliser avatar Jun 12 '25 21:06 rickkoliser