oci-bv - Timed out waiting for backup to become available
Hi,
We are using a VolumeSnapshotClass as below for Block volume snapshotting:
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: oci-bv-snapshot-incremental
driver: blockvolume.csi.oraclecloud.com
parameters:
backupType: incremental # No functional restore difference between full and incremental
deletionPolicy: Delete
This is integrated with CNPG for lower environment database volume snapshots. Occasionally (every few weeks), we find these backups failing. with the error:
DeadlineExceeded desc = Timed out waiting for backup to become available
It looks like this is being thrown by the oci-bv csi here: https://github.com/oracle/oci-cloud-controller-manager/blob/411bfeb22d242ba4c1f8b647884f507869968a2f/pkg/csi/driver/bv_controller.go#L1099
Which uses a timeout of 45 seconds as defined here: https://github.com/oracle/oci-cloud-controller-manager/blob/411bfeb22d242ba4c1f8b647884f507869968a2f/pkg/csi/driver/bv_controller.go#L1068
However, in practice a 45 second timeout is too conservative, looking in the logs, we see the following times for snapshot creation in uk-london-1 between going from com.oraclecloud.BlockVolumes.CreateVolumeBackup.begin to com.oraclecloud.BlockVolumes.CreateVolumeBackup.end state.
Over 9 samples: average: 37.4 seconds | min: 34 seconds | max: 41 seconds
With a backupPollInterval of 5 seconds, the CSI steps just outside of the permissible timeout of 45 seconds.
https://github.com/oracle/oci-cloud-controller-manager/blob/master/pkg/oci/client/block_storage.go#L150C36-L150C60
https://github.com/oracle/oci-cloud-controller-manager/blob/master/pkg/oci/client/block_storage.go#L42
I believe the solution for this would be to increase the available timeout to 60 seconds to align better with the expected response times from the API.
https://github.com/oracle/oci-cloud-controller-manager/blob/411bfeb22d242ba4c1f8b647884f507869968a2f/pkg/csi/driver/bv_controller.go#L1068
Thanks!
We are facing the same problem in the same situation.
Same problem here, also using CNPG. However, I believe that 60 seconds won't be enough for us. We have an 8TB and a 20TB disks that takes longer than that to become available. None of our attempts with snapshots on this database cluster have been successful. The smaller ones are working fine.
We get the similar error
"rpc error: code = DeadlineExceeded desc = Timed out waiting for backup to become available"
I wonder if there are any setting from the OCI block volume side to increase the default timeout to bigger value. The snapshot's LIFECYCLE_STATE has changed from CREATING to AVAILABLE, this timeout is less
All these cases should be handled by CSI inbuilt retries, where the CreateSnapshot functionality, even though it times out on the first try, eventually should go through on retries, since the same backup is checked against on retry if it has become available
Please share logs if this is not the behaviour being seen, I can take a look