cluster-api CAPI waiting forever for the volume to be detached

What steps did you take and what happened:

We have a use case where we are running a daemon set that mounts a volume. When draining the node, CAPI does not touch the daemon sets. Due to https://github.com/kubernetes-sigs/cluster-api/pull/4945 CAPI waits for the volumes to be detached and that is not an issue normally since all pods are deleted when draining. But since volumes are attached to the daemon set, they are never unmounted because the pod keeps running, which results in CAPI waiting forever for the volumes to be detached.

What did you expect to happen: Draining DS and detaching the volume successfully without deadlock

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Environment:

Cluster-api version: v0.4..4
Minikube/KIND version: NA
Kubernetes version: (use kubectl version): 1.23.3
OS (e.g. from /etc/os-release): SLES 15SP2

/kind bug [One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

Mar 10 '22 08:03 furkatgofurov7

/cc @sbueringer @fabriziopandini @vincepri

Mar 10 '22 08:03 furkatgofurov7

I think this happens because we use ignore all daemonsets: https://github.com/kubernetes-sigs/cluster-api/blob/main/internal/controllers/machine/machine_controller.go#L532

I don't think we can just change the hard-coded value to false, maybe we should make it configurable.

Mar 10 '22 11:03 sbueringer

This is intrinsic to DaemonSets, even if the property was false, drain would just be blocked but it won't graceful terminate the pod since it'd be automatically recreated by the DaemonSet. To fully remove the Pod from that Node you could leverage labels/NodeSelectors so this might come in handy https://github.com/kubernetes-sigs/cluster-api/pull/6255 for this scenario.

Mar 10 '22 11:03 enxebre

What about making it configurable whether we wait for the volumes to be all detached or not. Would it be problematic to delete a node where a daemonset still has a volume attached ? Once the node is deleted, the pod will be deleted and the pvc too. Would this be too risky ?

Mar 10 '22 15:03 maelk

I think this probably depends on the infrastructure provider. I saw some weird issues in cases like this with OpenStack. But this might be due to that specific OpenStack.

In the OpenStack case the CSI pod running on that node would be responsible to unmount etc. (I don't remember the details). Not sure how safe this is in general and how graceful this will be handled by CSI and the infrastructure if a node/server is just deleted with attached volumes.

Mar 10 '22 15:03 sbueringer

I think this old issue https://github.com/kubernetes/kubernetes/issues/54368#issuecomment-339378164 could be relevant here. It is about StatefulSets, but a DaemonSet with volumes is coming quite close.

From a safety perspective I guess the best thing would be to shutdown the node but then we are already in infrastructure provider territory.

Mar 11 '22 10:03 lentzi90

The draining of Daemonsets had been discussed in https://github.com/kubernetes/kubernetes/issues/75482 upstream k8s repo issue for a long time apparently but it has always been postponed (first been milestoned to v1.16) up until now. On the other hand, there is https://github.com/kubernetes-sigs/cluster-api/issues/6158 which is similar to what we have been discussing here.

Mar 11 '22 15:03 furkatgofurov7

You can use .NodeDrainTimeout to bypass draining after a timeout, the machine controller will proceed to delete the Node even though it has volumes attached. That would have the pertinent implications in each particular platform.

As for what cluster autoscaler does, we could consider doing something similar i.e taint deleting Nodes and evict DS pods manually. I'd question how strong is the use case as to support this opinionatedly vs enabling a consumer to do it as in https://github.com/kubernetes-sigs/cluster-api/issues/6285#issuecomment-1063967369

To get more context can you please elaborate your use case for running a DS with a volume? Also wil

Mar 15 '22 11:03 enxebre

We have the node drain timeout, but CAPI is still blocking on the volumes even after the timeout has expired. Maybe that could be modified so that we don't wait for the volumes to be detached after the timeout has expired ?

Regarding the use-case, we don't have much info. It is a customer using those and they did not give us any specific information.

Extending the timeout to the volumes would be acceptable solution for us.

Mar 15 '22 13:03 maelk

We have the node drain timeout, but CAPI is still blocking on the volumes even after the timeout has expired.

@maelk By looking a the code I'd be surprised if that's the case, can you validate/confirm share logs? https://github.com/kubernetes-sigs/cluster-api/blob/main/internal/controllers/machine/machine_controller.go#L309-L353

Mar 17 '22 08:03 enxebre

/milestone v1.2

Mar 22 '22 19:03 fabriziopandini

/priority awaiting-more-evidence to hear from https://github.com/kubernetes-sigs/cluster-api/issues/6285#issuecomment-1070557608

Mar 28 '22 17:03 enxebre

/milestone Next

Mar 28 '22 22:03 vincepri

Sorry for the delay, I was on PTO. I verified the inputs we got, and actually the node drain timeout was not set. So you are right about the current behaviour. Our users do not want to set it because Rook is deployed and using timeouts would override some internal control mechanisms. Could we maybe consider splitting the timeouts, or adding another control knob to change the timeout behaviour for the volume part only ?

Apr 04 '22 08:04 maelk

@enxebre @vincepri how do you like the idea of having split the timeouts? like having a general drain timeout + volumes timeout, and within that, a timeout specifically for volumes?

Apr 05 '22 08:04 furkatgofurov7

/cc @sbueringer @fabriziopandini

Apr 06 '22 09:04 furkatgofurov7

What we were thinking that could be doable is, introduce a new timeout, i.e called volumeDrainTimeout specifically for volumes, and then, whenever we check whether we should wait for volumes we would have a volumeDrainTimeout set, so in a case where we have nodeDrainTimeout is set to 0 (meaning wait indefinitely and that is the scenario we are dealing with) and we have a volume that can never be detached, we could wait until either nodeDrainTimeout OR volumeDrainTimeout exceeds, and based on whichever of the two timeouts exceeds first, we could move on with the node deletion, rather than being stuck waiting.

Apr 06 '22 13:04 furkatgofurov7

Our users do not want to set it because Rook is deployed and using timeouts would override some internal control mechanisms.

Can you elaborate on this so we can understand the full picture? If we consider this a generic valid use case not covered by drainTimeout, then either volumeDrainTimeout or supporting draining daemonsets e.g using taints make sense to me, otherwise this could be a valid use case for a runtime extension hook.

Apr 06 '22 16:04 enxebre

then either volumeDrainTimeout or supporting draining daemonsets e.g using taints make sense to me

@enxebre great, thanks! I will try to put a draft PR up soon.

Apr 07 '22 15:04 furkatgofurov7

Here is our use case with Rook: In order to preserve the Ceph cluster, Rook needs to control tightly the deprovisioning of the node, since there is a constraints on OSDs going down, if more than one replica of data is missing, the ceph cluster becomes unusable until it is back to max one replica of the data being gone. So to go through the upgrade, it uses PDB to prevent more than x nodes to go offline at once for reprovisioning. It will prevent the draining of any other node as long as the Ceph cluster is not ready to have more OSDs going down. So until the data is rebalanced and there is room to take another OSD down, the drain of the next node will be blocked. We do not want to override this mechanism with a drain timeout since it could break our Ceph cluster. However, we have a daemonset that mounts a volume, and currently that blocks the deprovisioning of the node without manual action. If we could set a timeout on that specific part of the deprovisioning without impacting the actual draining, then our problem would be solved. Do you see another alternative ?

Apr 08 '22 05:04 maelk

Sorry for getting late to this issue. I'm not 100% sure I got the difference between forcing deletion due to nodeDrainTimeout expiring or due to a new volumeDrainTimeout, because if I understand the above comments right in both cases we are going to delete the node with attached volumes, or am I wrong?

Also, a couple of comments about API changes if we go down this path:

if we are going to add a new timeout, it should go into ClusterTopology as well
similarly, it should go in KCP/CP CRs, but this requires a contract update
if I look at the machine API we are adding timeouts to spec without a proper organization; wondering if we can do a better job at modeling delete options

/cc @yastij @vincepri

Apr 12 '22 20:04 fabriziopandini

in both cases we are going to delete the node with attached volumes, or am I wrong?

@fabriziopandini correct.

if we are going to add a new timeout, it should go into ClusterTopology as well

I will look into adding it to #6413 which is on the flight now

similarly, it should go in KCP/CP CRs, but this requires a contract update

First part is covered I believe, would you mind elaborating more on contract update part?

f I look at the machine API we are adding timeouts to spec without a proper organization; wondering if we can do a better job at modeling delete options

:+1:

Apr 13 '22 09:04 furkatgofurov7

/assign /lifecycle active

Apr 13 '22 09:04 furkatgofurov7

@furkatgofurov7 if in both cases (existing nodeDrainTimeout and proposed volumeDrainTimeout) we are going to delete the node with attached volumes, why can't we use the existing nodeDrainTimeout?

Also, before proceeding, I would like to hear other opinions about the resulting API

WRT to the API contract the relevant part is documented here; Changing this requires community agreement.

Apr 13 '22 09:04 fabriziopandini

@furkatgofurov7 if in both cases (existing nodeDrainTimeout and proposed volumeDrainTimeout) we are going to delete the node with attached volumes, why can't we use the existing nodeDrainTimeout?

@fabriziopandini what if nodeDrainTimeout is set to default value (0s) or not set? In that case, we reach this condition where later we would be waiting for volumes to be detached indefinitely in the case of daemon sets.

Also, before proceeding, I would like to hear other opinions about the resulting API

:100: Absolutely, sorry I coded out that before and opened a PR today, maybe I should put hold or WIP, in that case?

WRT to the API contract the relevant part is documented here; Changing this requires community agreement.

Thanks, I will wait for others opinion and proceed based on the agreement we reach :+1:

Apr 13 '22 10:04 furkatgofurov7

what if nodeDrainTimeout is set to default value (0s) or not set? ...

If I got this right what we are trying to solve is to delete the node even if there are attached volumes. looking at the code, it seems to me that this can be achieved already today by setting nodeDrainTimeout to something greater than 0; when the timeout is reached the controller is going to stop waiting for volumes and the node deletion continues.

Apr 13 '22 10:04 fabriziopandini

it seems to me that this can be achieved already today by setting nodeDrainTimeout to something greater than 0

Correct, but as we discussed a bit earlier we have a use case where:

Our users do not want to set it because Rook is deployed and using timeouts would override some internal control mechanisms.

and for more info on Rook deployment, please check this comment: https://github.com/kubernetes-sigs/cluster-api/issues/6285#issuecomment-1092466360

Apr 13 '22 12:04 furkatgofurov7

Our users do not want to set it because Rook is deployed and using timeouts would override some internal control mechanisms

That's probably the missing bit for me, how the CAPI nodeDrainTimeout can override the Rook internal control mechanis?

Apr 13 '22 14:04 fabriziopandini

The issue with a timeout on the drain is that it renders the pdb settings done by Rook inefficient. Let's say we are doing an upgrade on a machineDeployment where Rook is deployed. Rook will set the PDB to allow 1 node to go away so that we keep enough OSDs (disks on the nodes) running to keep the Ceph cluster usable. So the draining of a first node will succeed. However, if a second node is to be drained by CAPI, it should block until the first node is back, and the pdb respected, otherwise the Ceph cluster will be unhealthy because of too many OSDs gone. So we cannot put a drain timeout because it might cause the second node to be deleted, and that would cause too many OSDs to be gone or the wrong OSD to be gone. However, once the first node comes back and the OSDs there are available again and the cluster is healthy again, then the draining will succeed. But in our case, a volume will block the deletion after draining succeeds.

The result is that we cannot set a drain timeout otherwise we might break the Ceph cluster, but we need a timeout in case one of the volumes does not get deleted, independently of the draining operation.

Does this make our use case clear ?

Apr 13 '22 15:04 maelk

Can you elaborate on this so we can understand the full picture? If we consider this a generic valid use case not covered by drainTimeout, then either volumeDrainTimeout or supporting draining daemonsets e.g using taints make sense to me, otherwise this could be a valid use case for a runtime extension hook.

By going through https://github.com/kubernetes-sigs/cluster-api/issues/6285#issuecomment-1092466360 am I right to say that your problem would be automaticallly solved if the pod having the volume were not owned by a daemon set so it would be naturally evicted as everything else letting the volume go away? Seems to me that the problem and the generic use case to be discussed and solved is evicting daemonSets rather than an additional timeout for volumes.

Apr 13 '22 15:04 enxebre