k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

helm: can't upgrade to 0.15.0 in place due to daemonset label selector change

Open mrparkers opened this issue 1 year ago • 3 comments

The newest version of the k8s-device-plugin chart seems to have removed support for specifying label selectors for each daemonset. Because these label selectors are now impossible to change (and this field is immutable and thus cannot be changed via the k8s API), this makes an in-place upgrade to v0.15.0 via helm very difficult.

You can use helm template along with yq to observe this change. If you have both of these tools installed, use this one-liner to observe the label selectors for v0.14.5:

helm template nvidia-device-plugin nvdp/nvidia-device-plugin --version 0.14.5 --set gfd.enabled=true | yq e 'select(.kind == "DaemonSet") | select(.metadata.name == "nvidia-device-plugin-gpu-feature-discovery") | .spec.selector.matchLabels'

This results in these label selectors:

app.kubernetes.io/name: gpu-feature-discovery
app.kubernetes.io/instance: nvidia-device-plugin

These are the default label selectors for GFD, but they can be changed via gfd.nameOverride in the values.

However, in v0.15.0, the default label selectors have changed, and there is no way to use helm values to change them back to what they were before:

helm template nvidia-device-plugin nvdp/nvidia-device-plugin --version 0.15.0 --set gfd.enabled=true | yq e 'select(.kind == "DaemonSet") | select(.metadata.name == "nvidia-device-plugin-gpu-feature-discovery") | .spec.selector.matchLabels'

This results in these label selectors:

app.kubernetes.io/name: nvidia-device-plugin
app.kubernetes.io/instance: nvidia-device-plugin

Because these label selectors cannot be changed in v0.15.0 by any helm value, any attempt at an upgrade results in an error that looks like this:

Helm upgrade failed for release kube-system/nvidia-device-plugin with chart [email protected]: cannot patch "nvidia-device-plugin-gpu-feature-discovery" with kind DaemonSet: DaemonSet.apps "nvidia-device-plugin-gpu-feature-discovery" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"app.kubernetes.io/instance":"nvidia-device-plugin", "app.kubernetes.io/name":"nvidia-device-plugin"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable

Is this a bug in v0.15.0 of the chart, or am I missing some other way to change these label selectors?

mrparkers avatar May 08 '24 20:05 mrparkers

Also, I'm happy to submit a PR to fix this, if this is indeed a bug.

mrparkers avatar May 09 '24 21:05 mrparkers

In the last release, we merged the code from GFD into the device plugin repo and deprecated the gpu-feature-discovery repo itself. I believe the change in values is likely an oversight that occurred as part of this merge (or if done on purpose, the implications of it weren't obvious at the time).

/cc @ArangoGutierrez and @elezar for their thoughts on what to do here

klueska avatar May 09 '24 21:05 klueska

@mrparkers As @klueska points out this is a side-effect of the migration and was not intentional. It should be considered a bug especially if it is preventing in-place upgrades.

If you're willing to submit a patch that would address this, that would be great. Please open a PR so that myself and @ArangoGutierrez can review.

elezar avatar May 10 '24 08:05 elezar

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions[bot] avatar Aug 09 '24 04:08 github-actions[bot]

This issue was automatically closed due to inactivity.

github-actions[bot] avatar Sep 08 '24 04:09 github-actions[bot]