Allow custom timeout for HelmChart
It is possible in helm to pass a custom timeout like below
helm upgrade --install --debug --timeout 1200s ..
Is this option supported in HelmChart?
If not can it be included?
@ibuildthecloud @erikwilson
Ping @erikwilson @ibuildthecloud any update? The default timeout for helm is 300s . For setups with minimal resources and using k3s, this will cause a restart for resources created and thrash the system. Charts with long running jobs e.g migrations are most affected by this.
Ping @galal-hussein
Alternatively could we default to
helm upgrade --install instead of helm --install?
@jawabuu currently passing timeout to the helmchart installation is not supported in the Helmchart spec https://github.com/k3s-io/helm-controller/blob/master/pkg/apis/helm.cattle.io/v1/types.go#L20-L29, and I dont think there is a plan for adding it to the spec in the near future.
Can you elaborate more on the issue, does installation takes more than 5 minutes in your setup?
Hey @galal-hussein , thank you for getting back to me.
Installation does take more than 5 minutes, superset and sentry apps are among examples that are easily reproducible.
@galal-hussein Once the installation fails, it will fail to rerun with the error
cannot re-use a name that is still in use and require a manual cleanup.
If a timeout is not supported, is there any objection to using
helm upgrade --install ?
@jawabuu we only do helm install --upgrade when the chart is already deployed https://github.com/k3s-io/klipper-helm/blob/master/entry#L39, so if the chart is not really installed and timedout, it will attempt to re-install it again
It can be a good idea to add extra-args to helm command in the chart spec cc @brandond
if we do that we'd probably need to be pretty flexible about it, and allow for specific install/upgrade/delete args?
The fastest way to solve this would probably be to build your own klipper-helm image that specifies the timeout you want, and set that as jobImage in the spec.
@galal-hussein This only works if the chart is in status deployed.
Otherwise it returns
cannot re-use a name that is still in use
- Checkout https://github.com/k3s-io/klipper-helm
- Modify https://github.com/k3s-io/klipper-helm/blob/master/entry#L31 by adding
--timeoute=xat the end - Run
make - Tag and push the resulting image to a registry
- Update
jobImagein your spec to point at your custom image
Even if we do modify helm-controller, it likely would not make it into k3s for several weeks until we do our next patch release.
Thanks @brandond. This will be helpful for me.
Whenever a release gets into an indeterminate state with error cannot re-use a name that is still in use,
One can clear it by either deleting or patching the secrets in this format
kubectl patch secret/sh.helm.release.v1.traefik.v2 --type=merge -p '{"metadata":{"labels":{"status":"deployed"}}}'
then installation succeeds.
Alternatively,
The operation helm upgrade --install .. is idempotent and takes care of these edge cases.
My request is if you would consider using it if you determine there are no downsides.
Also I ould patiently wait for that patch release :-)
@brandond , just to zero in on the actual issue.
klipper helm does not handle the pending- statuses in the helm_update() function.
The status pending-upgrade might result if a timeout occurs.
Subsequent installs will fail without 'manual' intervention or helm upgrade --install
A PR would be appreciated!
Update: this may be the root cause of some issues we're seeing with nginx upgrade failures in RKE2, so I'll be taking a look at it tomorrow.
A PR would be appreciated!
@brandond sure. I'll let you complete your findings first. For reference https://github.com/helm/helm/pull/7653
https://github.com/helm/helm/issues/5595#issuecomment-684312311
https://github.com/helm/helm/issues/5513
kubectl -n $namespace delete secret -lstatus=pending-upgrade
These are the available statuses
https://github.com/helm/helm/blob/bf486a25cdc12017c7dac74d1582a8a16acd37ea/cmd/helm/status.go#L41
- state of the release (can be: unknown, deployed, uninstalled, superseded, failed, uninstalling, pending-install, pending-upgrade or pending-rollback)
@brandond @galal-hussein Here is a complete example
- Query helmchart with error
$ kubectl get po -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-7448499f4d-ldbnb 1/1 Running 0 46h
helm-install-pomerium-khnr5 0/1 CrashLoopBackOff 29 30m
- Check logs
$ kubectl logs -f -n kube-system helm-install-pomerium-khnr5
CHART=$(sed -e "s/%{KUBERNETES_API}%/${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}/g" <<< "${CHART}")
set +v -x
+ cp /var/run/secrets/kubernetes.io/serviceaccount/ca.crt /usr/local/share/ca-certificates/
+ update-ca-certificates
WARNING: ca-certificates.crt does not contain exactly one certificate or CRL: skipping
+ '[' true '!=' true ']'
+ '[' '' == 1 ']'
+ '[' '' == v2 ']'
+ shopt -s nullglob
+ helm_content_decode
+ set -e
+ ENC_CHART_PATH=/chart/pomerium.tgz.base64
+ CHART_PATH=/pomerium.tgz
+ '[' '!' -f /chart/pomerium.tgz.base64 ']'
+ return
+ '[' install '!=' delete ']'
+ helm_repo_init
+ grep -q -e 'https\?://'
+ '[' helm_v3 == helm_v3 ']'
+ [[ pomerium == stable/* ]]
+ '[' -n https://helm.pomerium.io ']'
+ helm_v3 repo add pomerium https://helm.pomerium.io
"pomerium" has been added to your repositories
+ helm_v3 repo update
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "pomerium" chart repository
Update Complete. ⎈Happy Helming!⎈
+ helm_update install --namespace kube-system --repo https://helm.pomerium.io
+ '[' helm_v3 == helm_v3 ']'
++ helm_v3 ls -f '^pomerium$' --namespace kube-system --output json
++ tr '[:upper:]' '[:lower:]'
++ jq -r '"\(.[0].app_version),\(.[0].status)"'
+ LINE=null,null
++ cut -f1 -d,
++ echo null,null
+ INSTALLED_VERSION=null
++ cut -f2 -d,
++ echo null,null
+ STATUS=null
+ VALUES=
+ for VALUES_FILE in /config/*.yaml
+ VALUES=' --values /config/values-01_HelmChart.yaml'
+ '[' install = delete ']'
+ '[' -z null ']'
+ '[' null = deployed ']'
+ '[' null = failed ']'
+ '[' null = deleted ']'
+ helm_v3 install --namespace kube-system --repo https://helm.pomerium.io pomerium pomerium --values /config/values-01_HelmChart.yaml
Error: cannot re-use a name that is still in use
- List does not find chart
$ helm ls -f '^pomerium$' --namespace kube-system
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
- List with flag
-adoes find chart with statuspending-upgrade
$ helm ls -f '^pomerium$' --namespace kube-system -a
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
pomerium kube-system 4 2021-06-06 18:28:56.738300836 +0000 UTC pending-upgrade pomerium-20.0.0 0.14.4
In the end, the issue is that your filter for helm3 helm ls -f will not obtain all the charts and fallback to installing over an existing chart which will obviously fail.
Yes, the fix will handle that properly, as well as allow for extending the timeout.
@brandond Great!!
This is landing in RKE2 1.21; no ETA on when we will pull it in to K3s or older RKE2 releases.
Thanks @brandond!! Any idea how we can make use of it in older k3s releases considering its importance?
As before you can specify jobImage in your HelmChart to at least take advantage of the fixes for chart status handing. Using the timeout field on the spec will require an update the helm controller built in to k3s or rke2, so you'll need to wait for a release for that.
Got it. Thanks again. A question, what is the default command you rum when you encounter a $STATUS not explicitly handled?
@brandond error still happens with
pending-install
You should upgrade in this case
@jawabuu so pending-install can't be resolved by installing? That's unexpected. I was unable to get a chart stuck in this state and couldn't test. Is upgrading in this case always the correct behavior?
I reached out to @mattfarina about this to see what he suggests.
@brandond
- You can get a chart stuck in that state by specifying an incorrect
imagePullSecrets - I believe the correct fallback behaviour if not specifically handled should be
helm upgrade --install
The "pending" states are supposed to indicate that there is another Helm process altering the chart; but in this case the process has been interrupted so the lock is stale.
-
helm installwill fail because there's already a chart with this name present. (the current problem we're running into) -
helm upgrade --installwill not perform an install because there's already a release with this name. -
helm upgrade --installwill not perform an upgrade, because the previous release for this name is in a pending state.
It looks to me like the only thing to do in case of any pending state is to uninstall and reinstall.
It was suggested that we could just patch the secret for the most recent release to mark it failed, although this doesn't seem to actually work since the status is also stored inside the secret's release data in the info.status field.
@mattfarina suggested writing a small Go utility to grab the release secret and update the status to "failed", but this would require some new development work.
@brandond You are correct.
I also think increasing the timeout will help some of these situations (once it hits k3s)
What are your thoughts on having a more graceful operation when such pending statuses are encountered?
Maybe helm upgrade --install and a log output with suggestions instead of a CrashLoopBackOff situation.
@brandond An interesting workaround I've discovered;
- All Installs currently timeout at 5m (helm default)
- At this point klipper-helm retries and one of 2 thing s happens
a) Release is in
pending-installwhich is not handled and klipper-helm errors out b) Release is marked failed and klipper-helm proceeds to uninstall and reinstall, this is undesirable because all the pods might be up and only a migration job is running. Deleting will keep this release in an infinite loop.
What I have done is deleted the helm secret for the release created by klipper-helm before 5minutes elapse
Example;
kubectl delete secret/sh.helm.release.v1.superset.v1 -n superset
I do this every time the secret is created and once the post-install job completes, klipper-helm completes successfully.
I have tested this for superset and sentry charts both which have migrations that take well over 5 minutes in a 1cpu2gb instance.
I believe an approach for pending-install would be to delete the helm secret.
We're planning to write a tool to move the pending releases into failed state, and then trigger an upgrade or reinstall of the chart. RKE2 currently runs into this most frequently when upgrades are interrupted, due to issues with how we schedule the job pods. It's not that the install is timing out, it's that we're killing the apiserver to upgrade it, and helm can't finish the task nor can it talk to the apiserver to mark the release as failed.
See https://github.com/rancher/rke2/issues/1124 for a discussion of the larger issue.