oci-cloud-controller-manager icon indicating copy to clipboard operation
oci-cloud-controller-manager copied to clipboard

oci-bv - Timed out waiting for backup to become available

Open martysweet opened this issue 11 months ago • 4 comments

Hi,

We are using a VolumeSnapshotClass as below for Block volume snapshotting:

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: oci-bv-snapshot-incremental
driver: blockvolume.csi.oraclecloud.com
parameters:
  backupType: incremental # No functional restore difference between full and incremental
deletionPolicy: Delete

This is integrated with CNPG for lower environment database volume snapshots. Occasionally (every few weeks), we find these backups failing. with the error:

DeadlineExceeded desc = Timed out waiting for backup to become available

It looks like this is being thrown by the oci-bv csi here: https://github.com/oracle/oci-cloud-controller-manager/blob/411bfeb22d242ba4c1f8b647884f507869968a2f/pkg/csi/driver/bv_controller.go#L1099

Which uses a timeout of 45 seconds as defined here: https://github.com/oracle/oci-cloud-controller-manager/blob/411bfeb22d242ba4c1f8b647884f507869968a2f/pkg/csi/driver/bv_controller.go#L1068

However, in practice a 45 second timeout is too conservative, looking in the logs, we see the following times for snapshot creation in uk-london-1 between going from com.oraclecloud.BlockVolumes.CreateVolumeBackup.begin to com.oraclecloud.BlockVolumes.CreateVolumeBackup.end state.

Over 9 samples: average: 37.4 seconds | min: 34 seconds | max: 41 seconds

With a backupPollInterval of 5 seconds, the CSI steps just outside of the permissible timeout of 45 seconds. https://github.com/oracle/oci-cloud-controller-manager/blob/master/pkg/oci/client/block_storage.go#L150C36-L150C60

https://github.com/oracle/oci-cloud-controller-manager/blob/master/pkg/oci/client/block_storage.go#L42

I believe the solution for this would be to increase the available timeout to 60 seconds to align better with the expected response times from the API. https://github.com/oracle/oci-cloud-controller-manager/blob/411bfeb22d242ba4c1f8b647884f507869968a2f/pkg/csi/driver/bv_controller.go#L1068

Thanks!

martysweet avatar Feb 20 '25 11:02 martysweet

We are facing the same problem in the same situation.

ms3rgio avatar Feb 24 '25 20:02 ms3rgio

Same problem here, also using CNPG. However, I believe that 60 seconds won't be enough for us. We have an 8TB and a 20TB disks that takes longer than that to become available. None of our attempts with snapshots on this database cluster have been successful. The smaller ones are working fine.

silvio89 avatar Feb 24 '25 20:02 silvio89

We get the similar error "rpc error: code = DeadlineExceeded desc = Timed out waiting for backup to become available"

I wonder if there are any setting from the OCI block volume side to increase the default timeout to bigger value. The snapshot's LIFECYCLE_STATE has changed from CREATING to AVAILABLE, this timeout is less

santoshr1016 avatar Sep 04 '25 10:09 santoshr1016

All these cases should be handled by CSI inbuilt retries, where the CreateSnapshot functionality, even though it times out on the first try, eventually should go through on retries, since the same backup is checked against on retry if it has become available

Please share logs if this is not the behaviour being seen, I can take a look

YashasG98 avatar Sep 19 '25 05:09 YashasG98