materialize icon indicating copy to clipboard operation
materialize copied to clipboard

[EPIC] Support built-in table migrations for platform

Open jkosh44 opened this issue 3 years ago • 12 comments

Feature request

Problem Statement

Active replication relies on an object schema for a specific global ID to never change. (TODO: Expand on how does it rely on this?)

We frequently have to edit system catalog tables to add/remove/edit columns. However, system catalog tables currently have static global IDs. The global IDs are included in the object definition, for example: https://github.com/MaterializeInc/materialize/blob/e9a3c3b5f8d5283e2a0e82bb3cf55fa90aeccb4a/src/coord/src/catalog/builtin.rs#L1066-L1077

With a single binary version of Materialize this is OK because everything restarts together and the new definition is picked up on restart. However with the platform architecture this doesn't work because we have multiple processes that restart independently. Therefore when the ADAPTER restarts, it needs to notify the STORAGE and COMPUTE nodes about any schema changes to built-in objects.

Open Questions

  • How does the ADAPTER know which built-in objects have changed after a restart?
  • How does the ADAPTER inform STORAGE/COMPUTE nodes about changes?

Relevant Platform Architecture Background

Adapter

The ADAPTER contains the definition of all built-in objects. When adapter starts up it does three things:

  1. Populates in memory catalog state to include all built-in objects and user created objects. Built-in object definitions are compiled into the binary and user objects are stored in a sqlite database.
  2. Sends commands to STORAGE to create a Source for every catalog Table and Log.
  3. Sends commands to STORAGE to populate data in all built-in tables.

The ADAPTER also handles user queries. Depending on the query the ADAPTER may send a command to STORAGE to create a new source, send a command to COMPUTE to create a new dataflow, or both. For example CREATE TABLE foo (a int) will send a message to STORAGE to create a new source for foo. CREATE INDEX ex_ind AVG(schema_id) ON mz_tables will send a message to COMPUTE to create a new dataflow for ex_ind.

Storage

STORAGE nodes contain the definition and contents of Tables and Logs.

Compute

COMPUTE nodes contain dataflows and their outputs. Dataflows can output their results to an Index or a Sink. Dataflows can use Tables from STORAGE nodes or Indexes as inputs.

Notes

This is dependent on #11103 which lays the foundation for handling ADAPTER restarts.

Tasks

  • [x] https://github.com/MaterializeInc/materialize/pull/11741
  • [ ] Migrate introspection sources (#14125)
  • [ ] Convert SELECT * to SELECT a,b,... (list of all columns). This is so downstream views aren't changed by adding a column to a table.
  • [ ] Check the built-in objects in the Platform Checks framework

jkosh44 avatar Mar 28 '22 15:03 jkosh44

Proposals

Change detection

The following describes two separate proposals on how to detect if the schema for a built-in object has changed

Version Numbers (Proposal 1)

The current System variant of GlobalId looks like

    /// System namespace.
    System(u64),

We could add a version component so it looks like

    /// System namespace.
    System { id: u64, version: u64 },

One of the tables in our underlying SQLite catalog could be a mapping of system id to version, which indicates the current version of the system object. Whenever the ADAPTER restarts, it compares the version number of each built-in object compiled into the binary to the version number stored in SQLite. Any objects that have a different version number need to undergo a schema migration and then updated in SQLite.

The version number will be opaque to STORAGE and COMPUTE and only be semantically meaningful to the ADAPTER. Put another way, to COMPUTE and STORAGE, two objects with the same Id but different version are two completely different objects, only the ADAPTER knows that they refer to the same built-in object.

Dynamic System Ids (Proposal 2)

Currently built-in objects have their Ids compiled into the binary (See the issue description with mz_tables for an example). We can change this process to dynamically allocate system IDs on startup. We would need two additional tables in our underlying SQLite catalog. One table to tell us what the next available system id is (this is very similar to the existing table gid_alloc and can even be combined with that table). The other table would be a mapping of built-in table name to system id. The steps at startup would be the following:

  1. Read in previous name to id mapping from SQLite if one exists (if one doesn't exist then you can skip to step 4).
  2. Allocate new ids to all built-in objects.
  3. Perform a schema migration FOR ALL built-in objects (regardless if it changed or not).
  4. Update name to id mapping in SQLite.

We can leave actually detecting which objects have changed and only migrating those as a future optimization.

Schema Migration

The following describes how to change the schema for a specific object. Schema refers to the columns and column types of an object, not the schema namespace. It assumes that the ADAPTER knows what objects have changed, which is discussed in another section. It also leaves the discussion of IDs for another section. The process should be done before the adapter starts accepting connections and therefore doesn't need to be done transactionally. The following proposal does not work for removing a column or changing the type of a column UNLESS nothing depends on that column. In general for system tables, I don't think it's a good idea to remove columns or change their data type because it's unclear what to do with downstream dependencies, though one option is to just delete downstream dependencies.

The use of "Recursively" below is a bit hand wavy. It just means before dropping an object drop all objects that depend on it, and before dropping those objects drop all objects that depend on them, etc. The reverse is the case for recreating. Additionally when I say "Drop dataflows" what I really mean is allow compaction to [] and drop their outputs.

Tables

  1. Recursively drop all dataflows that use this table as a source.
  2. Recursively drop all views over the table.
  3. Drop table.
  4. Create table with new schema.
  5. Recursively recreate all views using new table.
  6. Recursively recreate all dataflows using new table.

The current Command Reconciliation approach (#11103) truncates and re-fills all catalog tables, so we don't have to worry about re-populating this new table. We just have to make sure this schema migration happens before the truncations.

Views

  1. Recursively drop all dataflows that use this view as a source.
  2. Recursively drop all views over the view.
  3. Drop view.
  4. Create view with new schema.
  5. Recursively recreate all views using new view.
  6. Recursively recreate all dataflows using new view.

Log/Introspection Source

  1. Recursively drop all dataflows that use this log as a source.
  2. Recursively drop all views over the log.
  3. Drop log.
  4. Create log with new schema.
  5. Recursively recreate all views using new log.
  6. Recursively recreate all dataflows using new log.

Types

These don't have schemas so no work needs to be done.

Functions

These don't have schemas so no work needs to be done.

jkosh44 avatar Mar 29 '22 19:03 jkosh44

One issue with the dynamic schema Id proposal is that it relies on the fact that all built-in objects have a unique name. This isn't necessarily true. All built-in logs are replicated to every compute instance with identical names but different global Ids. Not quite sure how we'd solve this right now, maybe the log names should be appended with the compute instance Id.

jkosh44 avatar Mar 30 '22 17:03 jkosh44

From offline discussion it seems like Dynamic System Ids is more popular than Version Numbers. There are two implementation details/caveats with this approach that I wanted to express:

  • BuiltinType defintions reference other BuiltinType definitions by their Id. This means that we either can't dynamically assign Ids to BuiltinTypes OR we have to be very smart about it. This is probably fine because BuiltinTypes don't have schema migrations and can have static Ids.
  • Currently built-in names aren't always unique. Every compute instance has a copy of a BuiltinLog index with the exact same name, but different system Id. This is also probably fine, indexes will never have schema migrations, only the logs that they index will migrate. In case the logs migrates, then the indexes will be torn down and recreated as part of the migration process. Therefore I think we can ignore indexes in the name to id mapping.

jkosh44 avatar Mar 31 '22 13:03 jkosh44

  • BuiltinType defintions reference other BuiltinType definitions by their Id. This means that we either can't dynamically assign Ids to BuiltinTypes OR we have to be very smart about it. This is probably fine because BuiltinTypes don't have schema migrations and can have static Ids.

Chatted on Slack and we can probably fix this by breaking the cycle with array types. Only the array types really need to reference the element type; element types can find their corresponding array type (if any) on the fly.

  • Currently built-in names aren't always unique. Every compute instance has a copy of a BuiltinLog index with the exact same name, but different system Id. This is also probably fine, indexes will never have schema migrations, only the logs that they index will migrate. In case the logs migrates, then the indexes will be torn down and recreated as part of the migration process. Therefore I think we can ignore indexes in the name to id mapping.

Oops! This seems like a bug. Probably want to change that logic to generate a unique name like LOGNAME-CLUSTERNAME-primary-idx, and append 1, 2, etc. until we get an actually unique name.

benesch avatar Apr 01 '22 03:04 benesch

Probably want to change that logic to generate a unique name like LOGNAME-CLUSTERNAME-primary-idx, and append 1, 2, etc. until we get an actually unique name.

Just to confirm, LOGNAME-CLUSTERNAME-primary-idx isn't good enough in case a user randomly has an index with the same name?

jkosh44 avatar Apr 01 '22 18:04 jkosh44

Yeah exactly!

benesch avatar Apr 01 '22 18:04 benesch

Just a random thought I had, in order to detect schema changes in the Dynamic System Ids approach, we can store a hash of the built-in object in the underlying mapping SQLite table. Then on start up we can compare the hash of all built-ins with their stored hash, if it's different then we know it needs to undergo a schema change.

jkosh44 avatar Apr 05 '22 20:04 jkosh44

I like the sound of that, @jkosh44!

benesch avatar Apr 05 '22 20:04 benesch

I have a PR (#11654) which extends #11643 that adds the functionality to detect schema changes. It doesn't actually perform any schema migrations but it adds the plumbing/metadata necessary for detecting them.

After #11643 is merged is it worth it to merge #11654 or should we wait until we actually implement the schema migration functionality? @benesch any thoughts?

jkosh44 avatar Apr 07 '22 20:04 jkosh44

Just wanted to write a quick update:

  • #11536 implemented dynamically allocated system IDs for all builtins except for types
  • #11643 implemented dynamically allocated system IDs for builtin types
  • #11654 implemented the ability to detect when the schema of a builtin objects changes.

So all that's left is to actually implement the schema migration. This is now somewhat blocked by command reconciliation, in that it doesn't really make sense to implement the feature that the ADAPTER can restart successfully when a built-in schema changes before we implement the feature that the ADAPTER can restart successfully when nothing changes.

jkosh44 avatar Apr 08 '22 16:04 jkosh44

#11741 implements the builtin migrations. It's currently impossible to test with platform until command reconciliation is complete. However it doesn't seem to break anything and from some manual testing works for migrations in the single binary world.

jkosh44 avatar Apr 15 '22 16:04 jkosh44

No this has been blocked for a long time. All the spam on this issue that you see has been me rebasing the PR that would fix this issue though.

jkosh44 avatar Jul 05 '22 16:07 jkosh44

The order of creating and deleting migrated objects is not always correct in the current implementation. However, for simple migrations it should still work. The following properties must be met for creation and deletion order:

  • Objects must be deleted in the dependency order from before the migration.
  • Object must be created in the dependency order from after the migration.

The following two reasons make the current orders incorrect:

  • We generate the order using BFS starting with the nodes who have new definitions. BFS is not guaranteed to be sorted in topological order.
  • The dependency between objects can change in between migrations.

The following PR fixes the creation order but not the deletion order: https://github.com/MaterializeInc/materialize/pull/14369

The following improvement can be made for fingerprint generation for builtin objects: Currently we has the entire object to determine if it is migrated. All we really should be hashing is the number, name, and type of each column.

@frankmcsherry Proposed a better approach to builtin migration that is less of a deviation of the current bootstrapping:

  1. Go through normal bootstrapping process where all objects are read from the stash and list of builtins to be created.
  2. If (the object is a builtin AND it's fingerprint doesn't match the persisted builtin) OR (the object depends on a marked object) Then: a. Mark the object and all it's dependencies as migrated. b. Give the object a new Global Id when creating it.
  3. At the end of bootstrapping delete all the objects that were marked as migrated.

jkosh44 avatar Aug 29 '22 16:08 jkosh44

Downgrading from epic. Current tracked in linked issue above

ggnall avatar Aug 31 '22 17:08 ggnall

@jkosh44 following up on this ticket, it seems like the majority of the work has been completed? The outstanding items from the original tickets are:

  1. Migrate storage collections
  2. Convert SELECT * to SELECT a,b,... (list of all columns). This is so downstream views aren't changed by adding a column to a table.
  3. Check the built-in objects in the Platform Checks framework
  4. Figure out how to fix Sinks.

If you can, I have a couple of questions about these work items, mostly trying to determine if they're still relevant

  1. Migrate storage collections

What are these? Is this still applicable?

  1. Convert SELECT * to SELECT a,b,...

AFAIK this is more complicated than we thought it would be, so it's in the backlog?

  1. Check the built-in objects in the Platform Checks framework

What is the "Platform Checks framework"?

  1. Figure out how to fix Sinks.

What about Sinks were broken, are they still broken?

Given how old this issue is, I'm thinking we should close this out, and describe any remaining work in a new ticket?

  1. Migrate storage collections

What are these? Is this still applicable?

I'm not sure what we're currently calling them, they're similar to introspection source but for STORAGE. They're catalog tables that STORAGE writes to directly without going through the Coordinator. I also don't know if they are able to be migrated cleanly. Some investigation is needed here.

  1. Convert SELECT * to SELECT a,b,...

AFAIK this is more complicated than we thought it would be, so it's in the backlog?

Yes: https://github.com/MaterializeInc/materialize/issues/16650

  1. Check the built-in objects in the Platform Checks framework

What is the "Platform Checks framework"?

I don't think I added this one, and if I did then I've forgotten all context. The Platform Checks framework is one of the testing frameworks we run in CI. It's what's responsible for the stages that start with "Check...". For example: https://buildkite.com/materialize/tests/builds/53600#01877b4f-2933-4ba7-b03e-9b49fa5c2d2f. The testing team can probably give you more information.

  1. Figure out how to fix Sinks.

What about Sinks were broken, are they still broken?

The current builtin migration framework works by deleting everything that needs to be migrated and then re-creating them. For sinks this means that the sinks will re-ingest all of it's data and send duplicate rows to the external system it's connected to.

jkosh44 avatar Apr 13 '23 17:04 jkosh44

Fantastic, thanks for the context Joe!