VM HA: recreate VM with local storage gets stuck at CLEANUP_RESUBMIT state
Description
I tried to configure the VM HA with a VM that is running on local storage. The hook is configured with recreate "-r" option. See the details documented here: https://docs.opennebula.io/7.0/product/control_plane_configuration/high_availability/vm_ha/ After producing an error, the hook correctly gets activated, the configured fencing mechanism shuts down the errored host, and the hook logs indicate that the recreate action is submitted. But the VM gets stuck in CLEANUP_RESUBMIT state, instead of being recreated on a new host.
VM state at stuck CLEANUP_RESUBMIT state, (=clea, see here)
ID USER GROUP NAME STAT CPU MEM HOST TIME
124 oneadmin oneadmin Alpine Linux 3.21-124 clea 1 256M ubuntu2404-kvm-ssh-7-0-bf20-1.test 0d 02h25
123 oneadmin oneadmin Alpine Linux 3.21-123 runn 1 256M ubuntu2404-kvm-ssh-7-0-bf20-2.test 1d 23h11
Hook logs show successful fencing, recreation and hook finish:
[2025-07-09 15:04:05 +0000][HOST 0][I] Fencing success
[2025-07-09 15:04:05 +0000][HOST 0][I] states: 3
[2025-07-09 15:04:05 +0000][HOST 0][I] vms: ["124"]
[2025-07-09 15:04:05 +0000][HOST 0][I] recreate 124
[2025-07-09 15:04:05 +0000][HOST 0][I] Hook finished
VM logs show that the Driver command is cancelled for the VM, but I am not sure how to find more info about why the command and which driver was cancelled...
Wed Jul 9 14:52:40 2025 [Z0][VM][I]: New LCM state is RUNNING
Wed Jul 9 14:59:03 2025 [Z0][VM][I]: New LCM state is UNKNOWN
Wed Jul 9 15:04:05 2025 [Z0][VM][I]: New state is ACTIVE
Wed Jul 9 15:04:05 2025 [Z0][VM][I]: New LCM state is CLEANUP_RESUBMIT
Wed Jul 9 15:04:05 2025 [Z0][VMM][I]: Driver command for 124 cancelled
To Reproduce Steps to reproduce the behavior:
- Configure VM HA, following: https://docs.opennebula.io/7.0/product/control_plane_configuration/high_availability/vm_ha/
- Configure the hook with recreate "-r" option
- Deploy a VM with its running on local storage (System DS is local)
- Force an error state in the host where the VM is running (e.g. pull down its interface that is used by the FE)
- Wait until the error state appears or force and update with
onehost forceupdate <ID> - Verify the hook is triggered, and the fencing successfully shuts down the host with error state
- Verify that the VM stays in "clea" = CLEANUP_RESUBMIT state.
Expected behavior After the recreate, the expectation is to have the VM scheduled on a different host that is available, and bring up the VM to running state.
Note: in this case, understandably, all state in the VM is lost, as its local disk has been cleaned up on the failed host. This is expected, and this type of VM HA would only make sense for stateless VMs. For VM migration a shared storage is necessary.
Details
- Affected Component: Storage, Hooks
- Hypervisor: KVM
- Version: 7.0
Progress Status
- [ ] Code committed
- [ ] Testing - QA
- [ ] Documentation (Release notes - resolved issues, compatibility, known issues)