helm-controller Helm release unable to rollback manual changes after enabling drift detection

Hello,

I am currently using flux version 'v0.41.2' (Helm Controller Version: v0.31.2). I have enabled '--feature-gates=DetectDrift=true' in Helm controller so that the manual changes in helm deployments can be automatically rolled back. After enabling this feature I have seen the following error in our Kyverno helm release.

$ flux get hr | grep -i kyverno                                                               

kyverno                         2.5.5+2dbac08dfae2.2    False           False   failed to diff release against cluster resources: Deployment/ky
verno/kyverno-kyverno dry-run failed, reason: Invalid, error: Deployment.apps "kyverno-kyverno" is invalid: spec.template.spec.containers[0].re
sources.requests: Invalid value: "500Mi": must be less than or equal to memory limit

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: kyverno
  namespace: flux-system
spec:
  serviceAccountName: helm-controller
  storageNamespace: kyverno
  targetNamespace: kyverno
  interval: 3m
  install:
    createNamespace: false
  chart:
    spec:
      reconcileStrategy: Revision
      chart: ./helm/kyverno/v1.7.5
      sourceRef:
        kind: GitRepository
        name: kyverno-infra-kube-manifest
        namespace: flux-system
      interval: 1m
      valuesFiles:
        - ./envs/<name>/stage/stage-white/helm/kyverno/values-v1.7.5.yaml

In the HelmChart the value for kyverno deployment memory limit is 200Mi & request is 100Mi. I have manually edited the deployment inside the cluster & increased request to 500Mi & limit to 1000Mi. After the changes helm is unable to automatically revert the changes & showing above error. I believe this error is coming as the actual memory request value present in the cluster (500Mi) > the memory limit present in the HelmChart. If I make the request less than 200Mi in the cluster, the error disappears & rollback automatically happen. However this is a problem as now these manual changes can't be reverted by Helm drift detection. Is there any solution to this problem? cc: @hiddeco

Jul 22 '23 17:07 sojinss4u

I am trying to reproduce this @sojinss4u . I have installed the kyverno helm chart with the following manifest

kind: HelmRelease
metadata:
  name: kyverno
  namespace: flux-system
spec:
  targetNamespace: kyverno
  releaseName: kyverno
  interval: 30m
  install:
    createNamespace: true
  chart:
    spec:
      chart: kyverno
      version: 2.6.0
      sourceRef:
        kind: HelmRepository
        name: kyverno
        namespace: flux-system
      interval: 30m

Then edited the kyverno deployment in cluster to use limits: 1000Mi and request 500Mi

Deployment/kyverno/kyverno diff:
.metadata.generation
-2
+3

.spec.template.spec.containers[0].resources.limits.memory
-1000Mi
+384Mi

.spec.template.spec.containers[0].resources.requests.memory
-500Mi
+128Mi

Is there any steps I might be missing? Pretty weird that the value on the cluster is causing an error during dry run since the manifest used for the dry run is gotten from the prev release

Jul 31 '23 12:07 somtochiama

Steps are correct. Basically this error comes when manually set limit on the cluster is > last release limit.

Aug 01 '23 02:08 sojinss4u

@sojinss4u I am unable to replicate your error then. The helm controller reverts the changes for me.

What memory request/limits do you see when you run helm get manifest <release-name> --revision <last-revision> --namespace flux-system. You can get the last revision by looking at .status.lastReleaseRevision in the helm release resource

Aug 01 '23 16:08 somtochiama

I've ran into a similar issue, it seems that this happens when not using the replacement strategy:

With upgrade.force = true changes are reverted as expected

With upgrade.force = false Drift detection enters a loop where on every iteration of spec.interval, it detects the changes, issues a patch, the patch apparently fails silently but the upgrade is logged as successful and no changes are actually applied.

I've ran into another caveat with upgrade.force=true, i'll open a separate issue

Nov 09 '23 11:11 rsafonseca

FYI, I only have metadata being logged on the API servers and not the full payload (that would cause an immense toll to logs on our clusters) but i can see that the patch command is being issued and returns a 200, but the spec of my object isn't really changed, so most likely the issue is with the patch payload itself. If i find any further info i'll drop it here later :)

Nov 09 '23 11:11 rsafonseca

I've build a custom version of the latest helm-controller, with a custom version of helm 3.12.3 which outputs the patches it's going to send to the API server, here's the relevant bits:

helm-controller detects drift on .spec.pod.resources.requests.cpu, triggers an upgrade {"level":"debug","ts":"2023-11-09T13:55:38.012Z","msg":"EtcdCluster/etcd/etcd-cluster diff:\n.metadata.generation\n-115\n+116\n\n.spec.pod.resources.requests.cpu\n-1801m\n+1800m","controller":"helmrelease","controllerGroup":"helm.toolkit.fluxcd.io","controllerKind":"HelmRelease","HelmRelease":{"name":"pitaya-cluster","namespace":"etcd"},"namespace":"etcd","name":"pitaya-cluster","reconcileID":"3ebe1de3-580f-4a12-8b04-46b99ba35b36"}

helm upgrade sends this patch to the API server:

{"level":"debug","ts":"2023-11-09T13:55:38.461Z","msg":"Patch EtcdCluster \"etcd-cluster\" in namespace etcd","controller":"helmrelease","controllerGroup":"helm.toolkit.fluxcd.io","controllerKind":"HelmRelease","HelmRelease":{"name":"pitaya-cluster","namespace":"etcd"},"namespace":"etcd","name":"pitaya-cluster","reconcileID":"3ebe1de3-580f-4a12-8b04-46b99ba35b36"}
{"level":"debug","ts":"2023-11-09T13:55:38.461Z","msg":"Patch: {\"metadata\":{\"annotations\":{\"meta.helm.sh/release-name\":\"pitaya-cluster\",\"meta.helm.sh/release-namespace\":\"etcd\"}}} ","controller":"helmrelease","controllerGroup":"helm.toolkit.fluxcd.io","controllerKind":"HelmRelease","HelmRelease":{"name":"pitaya-cluster","namespace":"etcd"},"namespace":"etcd","name":"pitaya-cluster","reconcileID":"3ebe1de3-580f-4a12-8b04-46b99ba35b36"}

Nov 09 '23 14:11 rsafonseca

Just to add, from my testing this seems to be affecting the patching of CRDs but not of native resources like Deployment. Seems like a helm bug to me..

Nov 09 '23 14:11 rsafonseca

Upon closer analysis, seems like this is unrelated to the original issue, I've opened https://github.com/fluxcd/helm-controller/issues/805 to address this problem specifically

Nov 09 '23 17:11 rsafonseca