redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

cloud_storage/inventory: Add inventory create-config API for AWS

Open abhijat opened this issue 2 years ago • 1 comments

The inventory API is introduced for AWS. The primary object introduced here is cloud_storage::inventory::inv_ops which is not wired into any service yet. Eventually it will be used by the scrubber-inventory download service (yet to be created) to schedule and download reports.

This PR contains the create_inventory_configuration method. The method added here schedules reports to be created for the given frequency and format by creating a report configuration. It calls PutBucketInventoryConfiguration

The API is divided into high level (vendor agnostic router) and low level (vendor specific) objects. The low level APIs, of which only AWS is implemented here, will hide the vendor specific actions required to perform tasks such as:

  • creating an inventory config (implemented in this PR)
  • ascertaining if the latest report is ready
  • downloading the latest report
  • cleaning up old reports

The high level API contains a variant to the low level API, and visits methods in the low level API depending on which vendor redpanda is deployed for. The variant will be extended to Azure and GCP as these are implemented.

Why not add methods directly to cloud_storage::remote instead of adding a new abstraction layer: remote can already call the right vendor HTTP verbs using specific clients, so potentially the create-inventory call could be added there without much overhead.

However, remote generally deals with single, retryable HTTP calls (except for a couple of bulk operation cases), and it is built around segment/manifest operations. Only a couple of general purpose methods are exposed there.

While the current inventory config creation maps to a single PUT call, the next methods which will be implemented in future PRs such as finding the latest report, checking if latest report is ready etc. will span multiple HTTP calls, for example downloading a JSON manifest, parsing paths in it and downloading those paths, and generally require logic which is inventory API specific.

It is easier to add a high level API to manage these operations than to push these down into the s3 client or the abs client.

Also, the GCS and AWS APIs are exactly similar for cloud storage operations so the same s3 client caters to both, but these two vendors differ in the inventory management calls. If we try to push down the details into the S3 client, it will need to be aware of the backend and call different implementations for GCS and AWS.

Backports Required

  • [x] none - not a bug fix
  • [ ] none - this is a backport
  • [ ] none - issue does not exist in previous branches
  • [ ] none - papercut/not impactful enough to backport
  • [ ] v23.3.x
  • [ ] v23.2.x
  • [ ] v23.1.x

Release Notes

  • none

abhijat avatar Feb 14 '24 14:02 abhijat

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/45040#018da85d-63df-4910-bef6-61b91aa6c062

vbotbuildovich avatar Feb 14 '24 17:02 vbotbuildovich