[Epic] Expose Storage Utilization Data
Initiative and Theme
Materialize is Dependable; Materialize makes observability easy
Problem
There is currently very little visibility into storage utilization, which affects the following use cases:
- It's a poor user experience if we charge customers for what we write to S3 without any explanation. We can probably get these costs in bulk from AWS, but it's possible we may not be able to break them down in any way.
- For customer-facing observability, users want to see how much storage they are using for their own diagnosis.
Success Criteria
There is a design document detailing the changes required to storage to support metering storage usage at the level required by the cloud team, and we implement these changes.
Time Horizon
3 weeks
Blockers
None
Blocks
https://github.com/MaterializeInc/cloud/issues/3200 https://github.com/MaterializeInc/cloud/issues/3259
Create a new persist object which will write, tag, and manage the full lifecycle of all files, which will then expose this information.
I don't have my head around this! What level of granularity are we targeting? S3 usage by source? Or something even more granular?
@benesch I purposefully left it vague because I want to see what we need from the cloud billing epic.
Oh, hah, I was going to say that it seemed quite specific in a direction that I did not expect!
If the overarching goal here is to monitor per source storage usage, there may be ways to do that that don’t involve any changes to persist. For example maybe S3 storage lens can provide us the visibility that we need: https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage_lens_basics_metrics_recommendations.html#storage_lens_basics_metrics_types
That would be nifty!
Ok, sorry, so then I'm confused about the phrasing of this epic! Is the idea that it is a placeholder work item for designing whatever changes may be required in storage to support the level of billing granularity settled on in #3200? If so, may I propose removing the references to persist? Perhaps something like:
[Epic] Integrate storage with cloud billing metering
Success criteria: a design document detailing the changes required to storage to support metering storage usage at the level required by the cloud billing system.
I think you meant to tag https://github.com/MaterializeInc/cloud/issues/3200? And that sounds good, I'll make the updates now.
I think you meant to tag MaterializeInc/cloud#3200? And that sounds good, I'll make the updates now.
Oops, I did, thanks!
Snowflake is instructive here. They have an ACCOUNT_USAGE schema which contains several different tables containing usage information for the account: https://docs.snowflake.com/en/sql-reference/account-usage.html. They also have a [TABLE_STORAGE_USAGE] view which is documented like so:
This view displays table-level storage utilization information, which is used to calculate the storage billing for each table in the account, including tables that have been dropped, but are still incurring storage costs.
The TABLE_STORAGE_USAGE table is documented to update every hour or two, which gives us some insight into how their internal systems work.
This blog post from Snowflake about Storage Profiling was helpful in surfacing some of these views.
Ultimately I think we should build towards something similar: an mz_storage_usage table that breaks down bytes used per storage collection. But I think we can aim for something simpler to solve Materialize Cloud's billing needs in the short term.
Put another way, I think this blocks https://github.com/MaterializeInc/cloud/issues/3259 but not https://github.com/MaterializeInc/cloud/issues/3200!
Thinking through what we need for cloud for billing (P1) and observability (P3) @nmeagan11 How does this sound?
- P1 Customer account-level storage usage snapshot (in bytes) which we can use to bill customers. Every 24 hours is probably workable. But something closer to hourly would be great. An alternative would be to capture this information more often (say hourly) but only store a daily average.
- P1 working with Cloud team to surface this information to Orb for billing.
- P3 storage object-level usage snapshot (in bytes) (see below) For every object which incurs storage (source, sink, table), we record the object name, object type (source, sink, table), timestamp, bytes used.
- P3 working with @jpepin and other teams to figure out how to put this information into system tables.
I think we should shoot for providing information in a table mz_storage_usage along these lines:
| storage_object_name | timestamp | storage_owned |
|---|---|---|
| table_sales_calls | 12/21/20 05:04:34 | 349873000 |
| my-kafka-source-1 | 12/21/20 05:04:34 | 8769873000 |
Plus maybe some columns which could be used to join the information with mz_sources, mz_tables, etc.
Ideally we can store information for 60 days (that way, the user can always see information for an open bill on their account). Happy to discuss if that is onerous/expensive though.
Questions for Eng:
- Do we store information about customers' databases (the database name, schema, views, roles, in tables beyond the system tables?
- Can we separate out storage the customer intended (the sources/tables/sinks they ingested/set up) from the information we collect in system tables? I want to know if we can avoid charging customers for that.
surface this information to Orb
Is there a specific format in which we need to provide to Orb?
Ideally we can store information for 60 days
Is 60 days arbitrary, or based on some compliance requirement or industry norm?
avoid charging customers for that
Do other products charge for metadata? I did a quick search but didn't find anything definitive.
For Orb we essentially create an "event" with the billable metric. I think your team can put the information in any number of places including a system table, but I'll check in with Eng on this.
60 days allows users to look at the numbers which went into their most recent (open, unpaid, not yet due) bill. We don't need to store every metric this way, but I think allowing folks to correlate their usage and top-level bill line items is essential.
WRT Metadata, it's unclear to me too. Snowflake has system tables, but its not clear from the breakdowns whether customers get charged for storing them. I'll try to get access and verify.
We are punting this out of M2 because we believe the cloud team requires no additional work from us to satisfy the M2 billing epic (cc @benesch @hlburak).
The work to collect and store s3 Storage Usage is being tracked in MaterializeInc/cloud#1467
Reopening because this needs tests and QA signoff! @jpepin, it probably makes sense for you to get some time with @philip-stoev and talk about how best to test this.
Issues required for completion of this epic:
- https://github.com/MaterializeInc/cloud/issues/3739
- https://github.com/MaterializeInc/cloud/issues/3737
Related but not necessarily required for billiing:
- https://github.com/MaterializeInc/cloud/issues/3726
- https://github.com/MaterializeInc/cloud/issues/3716
Closing, as the spirit of the epic is complete. Open question about what to do with the mz_storage_usage view, but https://github.com/MaterializeInc/materialize/issues/17180 can track that independently of this epic.