BUG: Re-deploy does not happen
Problem Description
Deploy a (Cluster)Profile with Helm chart. Then delete it. Before clean up happens, delete the cluster.
It appears helm client gets stuck in this situation.
If now cluster is recreated and so is the (Cluster)Profile, this never gets deployed. Same error keeps happening.
**E0627 14:21:32.913960 1 clustersummary_controller.go:368] "failed to deploy" err="cleanup of Helm still in progress. Wait before redeploying" controller="clustersummary" controllerGroup="config.projectsveltos.io" controllerKind="ClusterSummary" ClusterSummary="default/deploy-calico-capi-clusterapi-workload" namespace="default" name="deploy-calico-capi-clusterapi-workload" reconcileID="ee3012a7-9a64-4f08-8b19-31e797f61946" **
Workaround: delete the addon-controller pod.
@gianlucam76 we seem to be hitting this too although in slightly different scenario. Essentially what we did was:
- Create a cluster and wait for addons to come up
- Delete the cluster and wait for cluster and clustersummary to be deleted
- Create the cluster again with the same name Above case again we see the same error as above. I had a quick look at the code around this error and suspecting that the deployer object which is cached internally doesn't get deleted on cluster delete. Let me know if my undestanding on this below is incorrect.
Sveltos seems to maintain a deployer client that keeps track of the helm/policy/kustomize deployment and their status.
type deployer struct {
log logr.Logger
client.Client
..
}
func (d *deployer) IsInProgress(
clusterNamespace, clusterName, applicant, featureID string,
clusterType sveltosv1beta1.ClusterType,
cleanup bool,
) bool {
key := GetKey(clusterNamespace, clusterName, applicant, featureID, clusterType, cleanup)
d.mu.Lock()
defer d.mu.Unlock()
for i := range d.inProgress {
if d.inProgress[i] == key {
d.log.V(logs.LogVerbose).Info("request is already in inProgress")
return true
}
}
return false
}
The deployer object is initialised when addon controller registers a clustersummary object and this object is queried to fetch the state of the addon. The object itself doesn’t have a CR so its stored in the client memory in the addon controller. But the registration seems to happen with just the cluster and namespace.
I do see cleanup code for the deployer object, so will have to trace cleanup but I don’t see anything odd in the logs. So suspicion is that this deployer object isn’t cleaned up with previous deployment and is giving the addon controller stale data. Which is why restart would fix this as the cache or in memory object vanishes.
thank you for the detail explanation @psarwate I think you are correct issue sounds to be in the caching code.
Let me try to repro this. Will update here. Thank you
This is with v0.38.3 right?
Hi @psarwate I can repro with those steps:
- deploy kyverno with ClusterProfile
- delete cluster
- recreate cluster
now kyverno does not get redeployed with "err="cleanup of Helm still in progress. Wait before redeploying""
It is same as https://github.com/projectsveltos/addon-controller/issues/609
Will push a fix for it shortly
The problem is the helm sdk gets stuck when undeploying. Cluster is present but being deleted, so I call undeploy. Then helm sdk starts to undeploy but in the process, cluster is gone so not reachable anymore. Helm sdk never returns
I0926 14:17:56.383873 1 worker.go:172] "worker: 0 processing request. cleanup: true" logger="deployer" worker="0" key="default:::clusterapi-workload:::Capi:::deploy-kyverno-capi-clusterapi-workload:::Helm:::true"
I0926 14:17:56.383921 1 worker.go:174] "invoking handler" logger="deployer" worker="0" key="default:::clusterapi-workload:::Capi:::deploy-kyverno-capi-clusterapi-workload:::Helm:::true"
I0926 14:17:56.620020 1 handlers_helm.go:999] "uninstalling release" logger="deployer" worker="0" key="default:::clusterapi-workload:::Capi:::deploy-kyverno-capi-clusterapi-workload:::Helm:::true" cluster="default/clusterapi-workload" clusterSummary="deploy-kyverno-capi-clusterapi-workload" admin="/" release="kyverno-latest" releaseNamespace="kyverno"
There is no logger.V(logs.LogDebug).Info("uninstalling release done")
@psarwate I do have a fix. I was able to repro and add the fix and verify it.
Can you please file different bug?
@psarwate I do have a fix. I was able to repro and add the fix and verify it.
Can you please file different bug?
Thanks for looking at this @gianlucam76. I will file a new bug.
Thanks @psarwate
I also generated this image for you: projectsveltos/addon-controller:v0.38.3-test
so if you are using main or v0.38.3 just edit the addon-controller deployment in the management cluster and use this image. So you can also verify yourself. Thank you
And if that's OK, please consider whether adding to adopter list is an option. Thanks
Thanks for the quick turnaround @gianlucam76. Will test the image. Will get back to you on the adopter list.
Opened https://github.com/projectsveltos/addon-controller/issues/711. Thanks!