serverless-registry Garbage collector

This PR introduces a garbage collector. When we remove an image or tag from the repository, blobs referenced by deleted manifests are not removed from R2. With this new PR, we'll schedule garbage collecting after any modifying operations. GC will wait 10 minutes after any modifications are done to the repository this ensures we'll not start garbage collecting of freshly uploaded blobs without parenting manifest.

We have 2 modes for the garbage collector, unreferenced and untagged:

Unreferenced will delete all blobs that are not referenced by any manifest.
Untagged will delete all blobs that are not referenced by any manifest and are not tagged.

Users can skip the GARBAGE_COLLECTOR_MODE variable which will disable GC.

Some considerations:

PR utilizes new RPC, so we have updated compatibility_date
At this moment Set is used for keeping track of blobs, we can run out of memory and should use the bloom filter instead, which means we'll have a new dependency.
GC doesn't remove dummy manifests that reference non-existent blobs.

Apr 14 '24 18:04 IvanDev

Thanks for the contribution! I agree we should have some kind of garbage collection to remove dangling layers. However, I might have some concerns with having garbage collection done by a DO alarm or asynchronously.

AFAIU, there is a small chance that there is a new manifest uploaded while the garbage collection is happening and new layers might be removed. I think this is just something we will have to accept for this in some way to keep the registry implementation simple. If we go with this route, users of the registry that use garbage collection really need to know that it comes with some risks.

I am going to propose some things and let me know what you think (just throwing some stuff here for discussion):

I have mixed feelings about these passes being scheduled after a tag removal. Another simpler alternative is exposing a garbage collection endpoint and let the admin just trigger garbage collection when they deem safe to do so.
If we use DO: We could take a lock from the GC DO when we try to upload a manifest, but makes the implementation have more complexity.
If we use DO: when we remove a manifest couldn't we send over to the DO the layers we should look for so GC is a bit cheaper to do? No need to list all blobs through R2. If we are using DO we could also use some state.

IMO I'd go for the GC endpoint and just let anybody that needs GC to explicitly call it.

If we are feeling fancy with DOs we could do these passes in there, but still with workers unbounded plans + ctx.waitUntil() we can extend computation time already by a bit.

Hope you find this overall feedback fair! I am open to have GC just wanting to see your point of view on these 😄.

Apr 14 '24 22:04 gabivlj

Here are my 2 bits:

I totally agree with having a manual GC request being the safest option. In the recent changes, I added /:name+/collectgarbage route with optional mode=unreferenced|untagged parameter so users can trigger GC whenever they feel safe.
I also added a date check for the uploaded date field on R2 objects, so now when we scan through the blobs and if there are any changes within the last 10 minutes we'll abort GC scan to prevent damage to freshly uploaded layers. This should make the damage by GC even more unlikely. In addition, GC is now scheduled after 30 minutes this should cover for any network delays.
Introducing a lock will also introduce complexity and the potential for a deadlock. So unless we really want it, I guess we can skip it and use the current implementation since it should cover most of the cases.
Unfortunately, we can't minimize the request count to manifests since layers potentially can be referenced from multiple manifests (e.g. with tag/digest manifests). Counting the strong consistency of R2, relatively rare writes to the registry, and cost savings on storage by using even current primitive GC I think it should be good enough for now. As you mentioned, having DO state would make scanning almost x4.5 cheaper, but ROI on investing that much time into all of this complexity probably would be negligible, at least on most use cases with relatively rare writes (and users can even schedule GC manually now).

I agree with you on your concerns and I'm too a bit afraid of over-engineering :) I would prefer to keep the implementation simple and introduce complexity only when we really need it.

Apr 15 '24 12:04 IvanDev

I like the date change. I'd suggest we remove the asynchronous garbage collection if we have the endpoint. And in that case I do not think DO is necessary (trying to stay within R2 as much as possible here).

Apr 15 '24 14:04 gabivlj

Sure, let's keep it safe. Had to remove date checks since it's useless now. Let's keep users responsible for their own actions :)

Apr 15 '24 14:04 IvanDev

I think you will have to run the formatter on this PR and the other https://github.com/cloudflare/serverless-registry/pull/30/files

Apr 19 '24 18:04 gabivlj

Sure!

Apr 20 '24 14:04 IvanDev

upvote on this

May 27 '24 09:05 martinwang2002

Hello folks! Due to garbage collection popular demand I have cherry picked @IvanDev's commit for rebase and add a bit of concurrency safety that I thought could be nice. Thank you for the feedback. https://github.com/cloudflare/serverless-registry/pull/48

Sep 23 '24 04:09 gabivlj

Cherry picked commit from this PR has been merged here https://github.com/cloudflare/serverless-registry/pull/48. Thank you @IvanDev for the implementation and the discussion!

Sep 23 '24 19:09 gabivlj