wave icon indicating copy to clipboard operation
wave copied to clipboard

Wave occasionally causes secrets to be deleted unexpectedly

Open timothyb89 opened this issue 5 years ago • 8 comments

There seems to be a race condition where Wave will occasionally cause secrets to be unexpectedly deleted along with a parent deployment when only the deployment is removed via kubectl apply --prune ....

I've compiled a minimal reproduction case here, with some extra detail: https://github.com/timothyb89/wave-bug-test

I couldn't reproduce the bug with kubectl delete deployment ..., and after checking the exact API calls (via kubectl -v 10) the only difference in the actual DELETE call between kubectl delete vs kubectl apply --prune is that apply sets propagationPolicy=foreground while kubectl delete uses the default, which is evidently background.

Given this, I think the issue is related to Wave's finalizer and Kubernetes' deletion propagation policies. Per the docs, with background propagation, Kubernetes should wait for the parent resource to finish deleting before removing children, so it makes sense that kubectl delete deployment ... would never delete a secret given Wave's finalizer should have already run to completion.

On the other hand, with foreground propagation, I don't think Kubernetes ensures that parent finalizers (like Wave's) will finish before it starts removing child resources (or even explicitly does the inverse, removing children and then running the parent finalizers). It's surprising to me that secrets aren't always deleted in this case, but I guess Wave's finalizer can sometimes remove the ownerReference just in time to prevent the secret from being removed.

timothyb89 avatar Apr 13 '20 23:04 timothyb89

If it's the case the foreground deletion goes and deletes all of the children rather than letting the GC do it, then I don't think there's much that can be done about this apart from documenting that foreground deletion shouldn't be used.

Wave normally works because the background deletion relies on the GC to look for resources with owner references that now point into the ether, if the owner reference object is gone it removes the owner reference or if it's the last object, deletes it. With the finalizer logic in wave, this should never happen.

JoelSpeed avatar Apr 14 '20 16:04 JoelSpeed

That's what I suspected, though unfortunately kubectl apply does not allow users to set the propagation policy, so it's not really avoidable for workflows that need prune functionality.

I can think of a couple of possible workarounds:

  • additionally attach finalizers to secrets to remove the owner reference in background deletion
  • track deployment/secret relationships in some other field that does not cause cascading deletion, e.g. an annotation

... but both feel like relatively huge changes to fix a relatively niche issue. We're still debating how to work around this on our end, but figured it'd be good to at least report it in case anyone else stumbles on this.

timothyb89 avatar Apr 14 '20 21:04 timothyb89

I have a feeling any fix here might be worse than the cure, if I'm honest. It does make sense to add some documentation on this, though.

I am kind of curious what your use case is. Most places where I've seen --prune used would want to include any ConfigMap or Secret used by a Deployment in the applied yaml anyway, so this shouldn't be a problem?

wonderhoss avatar Apr 14 '20 21:04 wonderhoss

track deployment/secret relationships in some other field that does not cause cascading deletion, e.g. an annotation

Were I to redesign this now, I think this is probably the way I would have gone about it. Add an annotation that has a list of owners in it, then that can mapped to the right deployments.

I think it would be possible to do this but it would mean having migration code in there that would eventually be removed

  • On reconcile, if there is a finalizer, perform the migration code to remove the owner references and set the correct annotations, then remove the finalizer
  • Then set up the watches to use a map function instead of the owner references

I actually think this would resolve a number of issues that I've seen with the project in the past

JoelSpeed avatar Apr 14 '20 21:04 JoelSpeed

Most places where I've seen --prune used would want to include any ConfigMap or Secret used by a Deployment in the applied yaml anyway, so this shouldn't be a problem?

I think that makes sense for ConfigMaps, but we manage secrets separately to avoid checking them into Git, and normally expect that --prune should never touch these resources.

In our case, we removed a deployment, encountered an unrelated issue, reverted the change, and noticed an externally-generated database secret had disappeared.

Other less specific situations might include:

  • Renaming a deployment
  • Splitting off some functionality into a microservice
  • Changing a protected field with --prune --force, triggering a delete and recreate

timothyb89 avatar Apr 14 '20 22:04 timothyb89

Oh wow, interesting issue... Don't really have much to add to the discussion, and have never used kubectl apply --prune (didn't even know --prune was a thing until now).

track deployment/secret relationships in some other field that does not cause cascading deletion, e.g. an annotation

Were I to redesign this now, I think this is probably the way I would have gone about it. Add an annotation that has a list of owners in it, then that can mapped to the right deployments.

Although the other way around (I think, haven't had enough coffee yet), we toyed around with the idea of having an annotation on the Deployment/DaemonSet/StatefulSet to track the latest hash of a ConfigMap/Secret to be able to provide better feedback on what triggered a rollout (as an alternative approach to #74). If we were to go the annotation route with tracking ownership then we'd probably add this on as well.

Don't think we have any plans on starting implementing something like this now however, but PRs welcome? :)

mthssdrbrg avatar Apr 16 '20 10:04 mthssdrbrg

This issue has been inactive for 60 days. If the issue is still relevant please comment to re-activate the issue. If no action is taken within 7 days, the issue will be marked closed.

github-actions[bot] avatar Nov 23 '20 00:11 github-actions[bot]

Fixed by #154

jabdoa2 avatar Apr 30 '24 12:04 jabdoa2

@toelke this one can be closed. I guess I should have commented in the PR instead of here.

jabdoa2 avatar May 03 '24 12:05 jabdoa2