helm-controller icon indicating copy to clipboard operation
helm-controller copied to clipboard

Allow custom timeout for HelmChart

Open jawabuu opened this issue 4 years ago • 37 comments

It is possible in helm to pass a custom timeout like below helm upgrade --install --debug --timeout 1200s .. Is this option supported in HelmChart? If not can it be included? @ibuildthecloud @erikwilson

jawabuu avatar Apr 07 '21 08:04 jawabuu

Ping @erikwilson @ibuildthecloud any update? The default timeout for helm is 300s . For setups with minimal resources and using k3s, this will cause a restart for resources created and thrash the system. Charts with long running jobs e.g migrations are most affected by this.

jawabuu avatar Apr 24 '21 10:04 jawabuu

Ping @galal-hussein Alternatively could we default to helm upgrade --install instead of helm --install?

jawabuu avatar Jun 01 '21 17:06 jawabuu

@jawabuu currently passing timeout to the helmchart installation is not supported in the Helmchart spec https://github.com/k3s-io/helm-controller/blob/master/pkg/apis/helm.cattle.io/v1/types.go#L20-L29, and I dont think there is a plan for adding it to the spec in the near future.

Can you elaborate more on the issue, does installation takes more than 5 minutes in your setup?

galal-hussein avatar Jun 01 '21 17:06 galal-hussein

Hey @galal-hussein , thank you for getting back to me. Installation does take more than 5 minutes, superset and sentry apps are among examples that are easily reproducible.

jawabuu avatar Jun 01 '21 18:06 jawabuu

@galal-hussein Once the installation fails, it will fail to rerun with the error cannot re-use a name that is still in use and require a manual cleanup. If a timeout is not supported, is there any objection to using helm upgrade --install ?

jawabuu avatar Jun 01 '21 18:06 jawabuu

@jawabuu we only do helm install --upgrade when the chart is already deployed https://github.com/k3s-io/klipper-helm/blob/master/entry#L39, so if the chart is not really installed and timedout, it will attempt to re-install it again

galal-hussein avatar Jun 02 '21 15:06 galal-hussein

It can be a good idea to add extra-args to helm command in the chart spec cc @brandond

galal-hussein avatar Jun 02 '21 15:06 galal-hussein

if we do that we'd probably need to be pretty flexible about it, and allow for specific install/upgrade/delete args?

The fastest way to solve this would probably be to build your own klipper-helm image that specifies the timeout you want, and set that as jobImage in the spec.

brandond avatar Jun 02 '21 16:06 brandond

@galal-hussein This only works if the chart is in status deployed. Otherwise it returns cannot re-use a name that is still in use

jawabuu avatar Jun 02 '21 17:06 jawabuu

  1. Checkout https://github.com/k3s-io/klipper-helm
  2. Modify https://github.com/k3s-io/klipper-helm/blob/master/entry#L31 by adding --timeoute=x at the end
  3. Run make
  4. Tag and push the resulting image to a registry
  5. Update jobImage in your spec to point at your custom image

Even if we do modify helm-controller, it likely would not make it into k3s for several weeks until we do our next patch release.

brandond avatar Jun 02 '21 17:06 brandond

Thanks @brandond. This will be helpful for me. Whenever a release gets into an indeterminate state with error cannot re-use a name that is still in use, One can clear it by either deleting or patching the secrets in this format kubectl patch secret/sh.helm.release.v1.traefik.v2 --type=merge -p '{"metadata":{"labels":{"status":"deployed"}}}' then installation succeeds. Alternatively, The operation helm upgrade --install .. is idempotent and takes care of these edge cases. My request is if you would consider using it if you determine there are no downsides. Also I ould patiently wait for that patch release :-)

jawabuu avatar Jun 02 '21 19:06 jawabuu

@brandond , just to zero in on the actual issue. klipper helm does not handle the pending- statuses in the helm_update() function. The status pending-upgrade might result if a timeout occurs. Subsequent installs will fail without 'manual' intervention or helm upgrade --install

jawabuu avatar Jun 02 '21 19:06 jawabuu

A PR would be appreciated!

brandond avatar Jun 02 '21 20:06 brandond

Update: this may be the root cause of some issues we're seeing with nginx upgrade failures in RKE2, so I'll be taking a look at it tomorrow.

brandond avatar Jun 03 '21 00:06 brandond

A PR would be appreciated!

@brandond sure. I'll let you complete your findings first. For reference https://github.com/helm/helm/pull/7653

https://github.com/helm/helm/issues/5595#issuecomment-684312311

https://github.com/helm/helm/issues/5513

kubectl -n $namespace delete secret -lstatus=pending-upgrade

jawabuu avatar Jun 03 '21 03:06 jawabuu

These are the available statuses

https://github.com/helm/helm/blob/bf486a25cdc12017c7dac74d1582a8a16acd37ea/cmd/helm/status.go#L41 - state of the release (can be: unknown, deployed, uninstalled, superseded, failed, uninstalling, pending-install, pending-upgrade or pending-rollback)

jawabuu avatar Jun 03 '21 03:06 jawabuu

@brandond @galal-hussein Here is a complete example

  1. Query helmchart with error $ kubectl get po -n kube-system
NAME                                      READY   STATUS             RESTARTS   AGE
coredns-7448499f4d-ldbnb                  1/1     Running            0          46h
helm-install-pomerium-khnr5               0/1     CrashLoopBackOff   29         30m
  1. Check logs $ kubectl logs -f -n kube-system helm-install-pomerium-khnr5
CHART=$(sed -e "s/%{KUBERNETES_API}%/${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}/g" <<< "${CHART}")
set +v -x
+ cp /var/run/secrets/kubernetes.io/serviceaccount/ca.crt /usr/local/share/ca-certificates/
+ update-ca-certificates
WARNING: ca-certificates.crt does not contain exactly one certificate or CRL: skipping
+ '[' true '!=' true ']'
+ '[' '' == 1 ']'
+ '[' '' == v2 ']'
+ shopt -s nullglob
+ helm_content_decode
+ set -e
+ ENC_CHART_PATH=/chart/pomerium.tgz.base64
+ CHART_PATH=/pomerium.tgz
+ '[' '!' -f /chart/pomerium.tgz.base64 ']'
+ return
+ '[' install '!=' delete ']'
+ helm_repo_init
+ grep -q -e 'https\?://'
+ '[' helm_v3 == helm_v3 ']'
+ [[ pomerium == stable/* ]]
+ '[' -n https://helm.pomerium.io ']'
+ helm_v3 repo add pomerium https://helm.pomerium.io
"pomerium" has been added to your repositories
+ helm_v3 repo update
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "pomerium" chart repository
Update Complete. ⎈Happy Helming!⎈
+ helm_update install --namespace kube-system --repo https://helm.pomerium.io
+ '[' helm_v3 == helm_v3 ']'
++ helm_v3 ls -f '^pomerium$' --namespace kube-system --output json
++ tr '[:upper:]' '[:lower:]'
++ jq -r '"\(.[0].app_version),\(.[0].status)"'
+ LINE=null,null
++ cut -f1 -d,
++ echo null,null
+ INSTALLED_VERSION=null
++ cut -f2 -d,
++ echo null,null
+ STATUS=null
+ VALUES=
+ for VALUES_FILE in /config/*.yaml
+ VALUES=' --values /config/values-01_HelmChart.yaml'
+ '[' install = delete ']'
+ '[' -z null ']'
+ '[' null = deployed ']'
+ '[' null = failed ']'
+ '[' null = deleted ']'
+ helm_v3 install --namespace kube-system --repo https://helm.pomerium.io pomerium pomerium --values /config/values-01_HelmChart.yaml
Error: cannot re-use a name that is still in use
  1. List does not find chart $ helm ls -f '^pomerium$' --namespace kube-system
NAME    NAMESPACE       REVISION        UPDATED STATUS  CHART   APP VERSION
  1. List with flag -a does find chart with status pending-upgrade

$ helm ls -f '^pomerium$' --namespace kube-system -a

NAME            NAMESPACE       REVISION        UPDATED                                 STATUS          CHART           APP VERSION
pomerium        kube-system     4               2021-06-06 18:28:56.738300836 +0000 UTC pending-upgrade pomerium-20.0.0 0.14.4

In the end, the issue is that your filter for helm3 helm ls -f will not obtain all the charts and fallback to installing over an existing chart which will obviously fail.

jawabuu avatar Jun 06 '21 19:06 jawabuu

Yes, the fix will handle that properly, as well as allow for extending the timeout.

brandond avatar Jun 06 '21 19:06 brandond

@brandond Great!!

jawabuu avatar Jun 06 '21 20:06 jawabuu

This is landing in RKE2 1.21; no ETA on when we will pull it in to K3s or older RKE2 releases.

brandond avatar Jun 09 '21 00:06 brandond

Thanks @brandond!! Any idea how we can make use of it in older k3s releases considering its importance?

jawabuu avatar Jun 09 '21 07:06 jawabuu

As before you can specify jobImage in your HelmChart to at least take advantage of the fixes for chart status handing. Using the timeout field on the spec will require an update the helm controller built in to k3s or rke2, so you'll need to wait for a release for that.

brandond avatar Jun 09 '21 07:06 brandond

Got it. Thanks again. A question, what is the default command you rum when you encounter a $STATUS not explicitly handled?

jawabuu avatar Jun 09 '21 07:06 jawabuu

@brandond error still happens with pending-install You should upgrade in this case

jawabuu avatar Jun 09 '21 16:06 jawabuu

@jawabuu so pending-install can't be resolved by installing? That's unexpected. I was unable to get a chart stuck in this state and couldn't test. Is upgrading in this case always the correct behavior?

I reached out to @mattfarina about this to see what he suggests.

brandond avatar Jun 09 '21 17:06 brandond

@brandond

  1. You can get a chart stuck in that state by specifying an incorrect imagePullSecrets
  2. I believe the correct fallback behaviour if not specifically handled should be helm upgrade --install

jawabuu avatar Jun 09 '21 17:06 jawabuu

The "pending" states are supposed to indicate that there is another Helm process altering the chart; but in this case the process has been interrupted so the lock is stale.

  1. helm install will fail because there's already a chart with this name present. (the current problem we're running into)
  2. helm upgrade --install will not perform an install because there's already a release with this name.
  3. helm upgrade --install will not perform an upgrade, because the previous release for this name is in a pending state.

It looks to me like the only thing to do in case of any pending state is to uninstall and reinstall.

It was suggested that we could just patch the secret for the most recent release to mark it failed, although this doesn't seem to actually work since the status is also stored inside the secret's release data in the info.status field.

@mattfarina suggested writing a small Go utility to grab the release secret and update the status to "failed", but this would require some new development work.

brandond avatar Jun 09 '21 18:06 brandond

@brandond You are correct. I also think increasing the timeout will help some of these situations (once it hits k3s) What are your thoughts on having a more graceful operation when such pending statuses are encountered? Maybe helm upgrade --install and a log output with suggestions instead of a CrashLoopBackOff situation.

jawabuu avatar Jun 09 '21 19:06 jawabuu

@brandond An interesting workaround I've discovered;

  1. All Installs currently timeout at 5m (helm default)
  2. At this point klipper-helm retries and one of 2 thing s happens a) Release is in pending-install which is not handled and klipper-helm errors out b) Release is marked failed and klipper-helm proceeds to uninstall and reinstall, this is undesirable because all the pods might be up and only a migration job is running. Deleting will keep this release in an infinite loop.

What I have done is deleted the helm secret for the release created by klipper-helm before 5minutes elapse Example; kubectl delete secret/sh.helm.release.v1.superset.v1 -n superset I do this every time the secret is created and once the post-install job completes, klipper-helm completes successfully. I have tested this for superset and sentry charts both which have migrations that take well over 5 minutes in a 1cpu2gb instance. I believe an approach for pending-install would be to delete the helm secret.

jawabuu avatar Jun 09 '21 21:06 jawabuu

We're planning to write a tool to move the pending releases into failed state, and then trigger an upgrade or reinstall of the chart. RKE2 currently runs into this most frequently when upgrades are interrupted, due to issues with how we schedule the job pods. It's not that the install is timing out, it's that we're killing the apiserver to upgrade it, and helm can't finish the task nor can it talk to the apiserver to mark the release as failed.

See https://github.com/rancher/rke2/issues/1124 for a discussion of the larger issue.

brandond avatar Jun 09 '21 22:06 brandond