VM HA: recreate VM with local storage gets stuck at CLEANUP_RESUBMIT state

Open balazsbme opened this issue 6 months ago • 0 comments

Description

I tried to configure the VM HA with a VM that is running on local storage. The hook is configured with recreate "-r" option. See the details documented here: https://docs.opennebula.io/7.0/product/control_plane_configuration/high_availability/vm_ha/ After producing an error, the hook correctly gets activated, the configured fencing mechanism shuts down the errored host, and the hook logs indicate that the recreate action is submitted. But the VM gets stuck in CLEANUP_RESUBMIT state, instead of being recreated on a new host.

VM state at stuck CLEANUP_RESUBMIT state, (=clea, see here)

ID USER     GROUP    NAME                                                                   STAT  CPU     MEM HOST                                                  TIME
 124 oneadmin oneadmin Alpine Linux 3.21-124                                                  clea    1    256M ubuntu2404-kvm-ssh-7-0-bf20-1.test                0d 02h25
 123 oneadmin oneadmin Alpine Linux 3.21-123                                                  runn    1    256M ubuntu2404-kvm-ssh-7-0-bf20-2.test                1d 23h11

Hook logs show successful fencing, recreation and hook finish:

[2025-07-09 15:04:05 +0000][HOST 0][I] Fencing success
[2025-07-09 15:04:05 +0000][HOST 0][I] states: 3
[2025-07-09 15:04:05 +0000][HOST 0][I] vms: ["124"]
[2025-07-09 15:04:05 +0000][HOST 0][I] recreate 124
[2025-07-09 15:04:05 +0000][HOST 0][I] Hook finished

VM logs show that the Driver command is cancelled for the VM, but I am not sure how to find more info about why the command and which driver was cancelled...

Wed Jul  9 14:52:40 2025 [Z0][VM][I]: New LCM state is RUNNING
Wed Jul  9 14:59:03 2025 [Z0][VM][I]: New LCM state is UNKNOWN
Wed Jul  9 15:04:05 2025 [Z0][VM][I]: New state is ACTIVE
Wed Jul  9 15:04:05 2025 [Z0][VM][I]: New LCM state is CLEANUP_RESUBMIT
Wed Jul  9 15:04:05 2025 [Z0][VMM][I]: Driver command for 124 cancelled

To Reproduce Steps to reproduce the behavior:

Configure VM HA, following: https://docs.opennebula.io/7.0/product/control_plane_configuration/high_availability/vm_ha/
Configure the hook with recreate "-r" option
Deploy a VM with its running on local storage (System DS is local)
Force an error state in the host where the VM is running (e.g. pull down its interface that is used by the FE)
Wait until the error state appears or force and update with onehost forceupdate <ID>
Verify the hook is triggered, and the fencing successfully shuts down the host with error state
Verify that the VM stays in "clea" = CLEANUP_RESUBMIT state.

Expected behavior After the recreate, the expectation is to have the VM scheduled on a different host that is available, and bring up the VM to running state.

Note: in this case, understandably, all state in the VM is lost, as its local disk has been cleaned up on the failed host. This is expected, and this type of VM HA would only make sense for stateless VMs. For VM migration a shared storage is necessary.

Details

Affected Component: Storage, Hooks
Hypervisor: KVM
Version: 7.0

Progress Status

[ ] Code committed
[ ] Testing - QA
[ ] Documentation (Release notes - resolved issues, compatibility, known issues)

Jul 10 '25 08:07 balazsbme