Helm release unable to rollback manual changes after enabling drift detection
Hello,
I am currently using flux version 'v0.41.2' (Helm Controller Version: v0.31.2). I have enabled '--feature-gates=DetectDrift=true' in Helm controller so that the manual changes in helm deployments can be automatically rolled back. After enabling this feature I have seen the following error in our Kyverno helm release.
$ flux get hr | grep -i kyverno
kyverno 2.5.5+2dbac08dfae2.2 False False failed to diff release against cluster resources: Deployment/ky
verno/kyverno-kyverno dry-run failed, reason: Invalid, error: Deployment.apps "kyverno-kyverno" is invalid: spec.template.spec.containers[0].re
sources.requests: Invalid value: "500Mi": must be less than or equal to memory limit
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
name: kyverno
namespace: flux-system
spec:
serviceAccountName: helm-controller
storageNamespace: kyverno
targetNamespace: kyverno
interval: 3m
install:
createNamespace: false
chart:
spec:
reconcileStrategy: Revision
chart: ./helm/kyverno/v1.7.5
sourceRef:
kind: GitRepository
name: kyverno-infra-kube-manifest
namespace: flux-system
interval: 1m
valuesFiles:
- ./envs/<name>/stage/stage-white/helm/kyverno/values-v1.7.5.yaml
In the HelmChart the value for kyverno deployment memory limit is 200Mi & request is 100Mi. I have manually edited the deployment inside the cluster & increased request to 500Mi & limit to 1000Mi. After the changes helm is unable to automatically revert the changes & showing above error. I believe this error is coming as the actual memory request value present in the cluster (500Mi) > the memory limit present in the HelmChart. If I make the request less than 200Mi in the cluster, the error disappears & rollback automatically happen. However this is a problem as now these manual changes can't be reverted by Helm drift detection. Is there any solution to this problem? cc: @hiddeco
I am trying to reproduce this @sojinss4u . I have installed the kyverno helm chart with the following manifest
kind: HelmRelease
metadata:
name: kyverno
namespace: flux-system
spec:
targetNamespace: kyverno
releaseName: kyverno
interval: 30m
install:
createNamespace: true
chart:
spec:
chart: kyverno
version: 2.6.0
sourceRef:
kind: HelmRepository
name: kyverno
namespace: flux-system
interval: 30m
Then edited the kyverno deployment in cluster to use limits: 1000Mi and request 500Mi
Deployment/kyverno/kyverno diff:
.metadata.generation
-2
+3
.spec.template.spec.containers[0].resources.limits.memory
-1000Mi
+384Mi
.spec.template.spec.containers[0].resources.requests.memory
-500Mi
+128Mi
Is there any steps I might be missing? Pretty weird that the value on the cluster is causing an error during dry run since the manifest used for the dry run is gotten from the prev release
Steps are correct. Basically this error comes when manually set limit on the cluster is > last release limit.
@sojinss4u I am unable to replicate your error then. The helm controller reverts the changes for me.
What memory request/limits do you see when you run helm get manifest <release-name> --revision <last-revision> --namespace flux-system. You can get the last revision by looking at .status.lastReleaseRevision in the helm release resource
I've ran into a similar issue, it seems that this happens when not using the replacement strategy:
With upgrade.force = true changes are reverted as expected
With upgrade.force = false Drift detection enters a loop where on every iteration of spec.interval, it detects the changes, issues a patch, the patch apparently fails silently but the upgrade is logged as successful and no changes are actually applied.
I've ran into another caveat with upgrade.force=true, i'll open a separate issue
FYI, I only have metadata being logged on the API servers and not the full payload (that would cause an immense toll to logs on our clusters) but i can see that the patch command is being issued and returns a 200, but the spec of my object isn't really changed, so most likely the issue is with the patch payload itself. If i find any further info i'll drop it here later :)
I've build a custom version of the latest helm-controller, with a custom version of helm 3.12.3 which outputs the patches it's going to send to the API server, here's the relevant bits:
helm-controller detects drift on .spec.pod.resources.requests.cpu, triggers an upgrade
{"level":"debug","ts":"2023-11-09T13:55:38.012Z","msg":"EtcdCluster/etcd/etcd-cluster diff:\n.metadata.generation\n-115\n+116\n\n.spec.pod.resources.requests.cpu\n-1801m\n+1800m","controller":"helmrelease","controllerGroup":"helm.toolkit.fluxcd.io","controllerKind":"HelmRelease","HelmRelease":{"name":"pitaya-cluster","namespace":"etcd"},"namespace":"etcd","name":"pitaya-cluster","reconcileID":"3ebe1de3-580f-4a12-8b04-46b99ba35b36"}
helm upgrade sends this patch to the API server:
{"level":"debug","ts":"2023-11-09T13:55:38.461Z","msg":"Patch EtcdCluster \"etcd-cluster\" in namespace etcd","controller":"helmrelease","controllerGroup":"helm.toolkit.fluxcd.io","controllerKind":"HelmRelease","HelmRelease":{"name":"pitaya-cluster","namespace":"etcd"},"namespace":"etcd","name":"pitaya-cluster","reconcileID":"3ebe1de3-580f-4a12-8b04-46b99ba35b36"}
{"level":"debug","ts":"2023-11-09T13:55:38.461Z","msg":"Patch: {\"metadata\":{\"annotations\":{\"meta.helm.sh/release-name\":\"pitaya-cluster\",\"meta.helm.sh/release-namespace\":\"etcd\"}}} ","controller":"helmrelease","controllerGroup":"helm.toolkit.fluxcd.io","controllerKind":"HelmRelease","HelmRelease":{"name":"pitaya-cluster","namespace":"etcd"},"namespace":"etcd","name":"pitaya-cluster","reconcileID":"3ebe1de3-580f-4a12-8b04-46b99ba35b36"}
Just to add, from my testing this seems to be affecting the patching of CRDs but not of native resources like Deployment. Seems like a helm bug to me..
Upon closer analysis, seems like this is unrelated to the original issue, I've opened https://github.com/fluxcd/helm-controller/issues/805 to address this problem specifically