redpanda cloud_storage: bucket scrub

By design, Redpanda will sometimes leave orphan objects in its object storage bucket. This happens when a node writes a segment, but then unexpectedly loses leadership before it can update the manifest. We do our best to avoid it (https://github.com/redpanda-data/redpanda/pull/8560) but it will happen from time to time.

Like any storage system, to ensure good data hygiene over long storage periods, Redpanda needs a data scrubbing feature. This can be more or less extensive depending on the needs of a given system:

The most lightweight scrub consists of reconciling an object listing with the contents of the topic table and of manifests:
- all segments should either exist in a manifest, or correspond to a manifest spill range (in infinite storage) for a known partition.
- all segments referenced by a manifest should exist in the object store
The most heavyweight scrub requires reading every byte of every object and validating the CRCs on every batch.

The extreme scrubbing is probably only useful on less-trusted object stores (e.g. if someone uses minio with its basic filesystem backend) -- there is less value in scrubbing a more highly trusted backend like AWS S3.

JIRA Link: CORE-1177

Feb 23 '23 16:02 jcsp

There's a functional draft of updating the scrubber to clean up orphan segments here: https://github.com/redpanda-data/redpanda/tree/orphan-cleanup

Jul 17 '23 14:07 jcsp

We should ensure this can be disabled, for customers that prefer to have their buckets immutable.

May 21 '24 09:05 pmw-rp