cloud_storage: bucket scrub
By design, Redpanda will sometimes leave orphan objects in its object storage bucket. This happens when a node writes a segment, but then unexpectedly loses leadership before it can update the manifest. We do our best to avoid it (https://github.com/redpanda-data/redpanda/pull/8560) but it will happen from time to time.
Like any storage system, to ensure good data hygiene over long storage periods, Redpanda needs a data scrubbing feature. This can be more or less extensive depending on the needs of a given system:
- The most lightweight scrub consists of reconciling an object listing with the contents of the topic table and of manifests:
- all segments should either exist in a manifest, or correspond to a manifest spill range (in infinite storage) for a known partition.
- all segments referenced by a manifest should exist in the object store
- The most heavyweight scrub requires reading every byte of every object and validating the CRCs on every batch.
The extreme scrubbing is probably only useful on less-trusted object stores (e.g. if someone uses minio with its basic filesystem backend) -- there is less value in scrubbing a more highly trusted backend like AWS S3.
JIRA Link: CORE-1177
There's a functional draft of updating the scrubber to clean up orphan segments here: https://github.com/redpanda-data/redpanda/tree/orphan-cleanup
We should ensure this can be disabled, for customers that prefer to have their buckets immutable.