materialize [dbt] dbt deploy

Motivation

Fixes #24186.

We want to integrate blue green deployment work into a simple to use dbt macro. We introduce a new deployment configuration in dbt_project.yml that are scoped to the current target.

Users specify any number of clusters and schemas to be part of a deployment.

vars:
  deployment:
    default:
      clusters:
        - prod: red
          prod_deploy: yellow
      schemas:
        - prod: blue
          prod_deploy: green

There are then three commands to execute:

`dbt run-operation create_deployment_environment`

Automatically creates the prod_deploy schemas and clusters. Clusters are created with the same configuration as their corresponding prod target. The command will error if the prod cluster or schema does not exist. It is valid for the prod_deploy schema or cluster to be precreated. For example, a user might want to deploy on a different sized cluster. If the target cluster or schema does exist, the command will output a warning if it contains any objects.

`dbt run-operation deploy`

This command waits until all clusters are hydrated and have lag less than 1 second, and then executes the alter swap commands in a transaction. It will fail if any of the clusters or schemas do not exist.

`dbt run-operation drop_deployment_environment`

Drops all the prod_deploy schemas and clusters for easy cleanup.

This PR is missing tests because I wasn't sure how to do that and I'm looking for early feedback on the direction.

Checklist

[ ] This PR has adequate test coverage / QA involvement has been duly considered.
[ ] This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
[ ] If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
[ ] If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
[ ] This PR includes the following user-facing behavior changes:

Feb 12 '24 21:02 sjwiesman

@morsapaes I believe I've addressed all the review comments. I have run this locally under a few different scenarios and feel pretty good about the correctness. I do need some guidance on how to write tests. I'm not sure how to prestage and environment with clusters and objects or check the state after dbt runs.

Feb 13 '24 17:02 sjwiesman

@morsapaes happy to be opinionated. What are you thinking? Force sources / sink to be in a different schema than indexes / mat views? Different clusters even?

Feb 13 '24 20:02 sjwiesman

Regarding tests, there are virtually no dbt-specific developer docs in the repo, but the examples under /misc/dbt-materialize/tests/adapter should be enough for you to riff off. In particular, check out the cluster tests.

You can then use mz_compose to run the tests (see MAINTAINERS).

Feb 14 '24 00:02 morsapaes

Tests are in place, I'm marking the PR as ready to review.

Feb 14 '24 22:02 sjwiesman

Thanks for putting this together! This will help a ton.

I think the testing should include a scenario where a source is updated as well. This would include creating a new cluster and schema for the new source and then swapping, which fits in nicely with what this macro already does.

However, there is an additional detail for deploying a new source that needs to be addressed in the test and the eventual documentation beyond swapping clusters and schemas. In order for the new models to reference the new source, you would need a variable that tells the model which schema the new sources are in. You shouldn’t simply use the target schema because it’s possible for many different views in many different schemas to reference the same sources. It makes sense to keep source schema separate.

Here is a schema.yml file that we may use to define a source to be referenced:


sources:
  - name: my_source
    database: '{{ target.database }}'
    schema: '{{ var("source_schema", “public”) }}'
    description: counts up
    tables:
      - name: subsource_1
      - name: subsource_2

Then the source can be referenced in the model like this:

…
FROM {{ source('my_source', 'subsource_1') }}
…

Notice the var for source schema. The dbt run command that deploys the new models would then need to supply this source_schema variable so the compiled SQL correctly resolves the schema of the new sources. If the variable is not supplied, then the default would fall back to the second argument of the var function, which in this case is public.

Feb 16 '24 16:02 chuck-alt-delete

This looks like a really good approach. Code and tests all look good to me. I am a little worried 1 second of lag will be too tight a threshold in some cases… I might go with 5 or so. Also should we fail on schemas with sinks in them? Creating new sinks rather than keeping them outside the deploy schema seems like a rare thing you would only want to do very purposefully IMO.

I will find time to run some tests with this this week.

Feb 18 '24 16:02 josharenberg

Creating new sinks rather than keeping them outside the deploy schema seems like a rare thing you would only want to do very purposefully IMO.

Yeah, we should fail the operation if there are sources or sinks in the cluster or schema. This is the biggest footgun users run into as is, running the sequence of commands by hand. The downside here is that we don't support ALTER SOURCE/SINK...SET SCHEMA (#22615) or ALTER SOURCE/SINK...IN CLUSTER (#17417), so it might be a little late to enforce namespacing conventions.

Thinking out loud: we could maybe use a prompt (something like this?) to warn users about a potential destructive action, and leave it up to them whether they want to proceed or not.

Feb 19 '24 00:02 morsapaes

Nice! Really excited to see this coming together. Assorted thoughts incoming.

Thought 1. Allowing users to choose the name of the prod and prod_deploy clusters seems like an unnecessary degree of freedom! Also I'm not personally sold on those names being maximally clear, though they're certainly better than blue and green, and I know the FE team has been having some success using those names with customers.

I think we can sidestep the whole conversation by allowing dbt-materialize to be opinionated. Why not force the _deploy naming convention? Or, to avoid collisions, go with something a little more unique like _dbt_deploy.

Then users would need only specify the names of the clusters and schemas they want to deploy into...

vars:
  deployment:
    default:
      clusters: [usecase1, usecase2, usecase3]
      schemas: [usecase1_subpart1, usecase1_subpart2, usecase2, usecase3]

...and the adapter would automatically deploy into usecase1_dbt_deploy, usecase2_dbt_deploy, etc.

Thought 2. Would be nice to have a consistent prefix for the macro names. As it stands it's clear that create_deployment_environment and drop_deployment_environment are related, but not blindingly obvious that deploy is related.

Here's a different slicing of the macros that I think is worth exploring:

deploy_init — identical to create_deployment_environment today
<user runs dbt run>
deploy_wait — runs macro to wait for all clusters to be hydrated
<user runs their own validation>
deploy_promote — swaps all clusters and schemas in a transaction
deploy_promote(wait=True) — combines deploy_wait and deploy_swap into a single step, for users who don't want to run their own validation
deploy_cleanup — identical to drop_deployment_environment today

And a jumble of justifications for the above:

I think deploy_cleanup is a little less scary than drop_deployment_environment; it just reads a little less to me like that might accidentally drop your production deployment. The flip side is that "create" and "drop" are nicely SQL-y. If we just want to introduce a consistent prefix but otherwise keep the existing verbs, could go with deploy_create and deploy_drop.
I like separating out deploy_wait and deploy_promote because it implies that you should be doing additional application-specific validation between waiting and promoting. The deploy_promote(wait=True) is still available for users who want to YOLO a bit more.
I also like not calling anything a bare deploy because it avoids blessing any one step as the "deploy" step. Like, the whole thing is the deploy, and you need to do all the steps to complete your deploy.

Thought 3. It's a little scary that deployments aren't idempotent. Like, running deploy_promote multiple times ideally wouldn't repeatedly swap the schemas around. I wonder if we should somehow include a marker object in the schemas so we can tell whether the deployment was already swapped or not. Probably doesn't matter much in practice though.

Thought 4. Handling of sources and sinks.

First of all, strong agree that we should probably warn or fail if you have a sink in any of the schemas. Sinks are different and hard, and almost surely you want to be handling them specially, outside of this macro.

However, I think sources are much trickier! If you're managing sources in your dbt project, having a source in your schema is totally well and fine—you'll just recreate it on every deploy. For use cases that can tolerate this, this honestly seems desirable!

The problem with sources occurs when you manage them outside of dbt (e.g., by hand or with Terraform) but in the same schema that you're trying to blue/green with dbt. I wonder if we can detect this situation in the adapter, and only complain if you're trying to blue/green a schema with a source that's not managed by dbt. But I'm not sure this is really a source-specific concern! We should probably detect and warn if there are any objects in the schema that aren't managed by dbt.

However, there is an additional detail for deploying a new source that needs to be addressed in the test and the eventual documentation beyond swapping clusters and schemas. In order for the new models to reference the new source, you would need a variable that tells the model which schema the new sources are in. You shouldn’t simply use the target schema because it’s possible for many different views in many different schemas to reference the same sources. It makes sense to keep source schema separate.

I want to point out that this isn't just a source-specific problem! This is a general problem that occurs whenever you have daisy chained use cases. If I have a materialized view in use case 1 named mv1, and a downstream materialized view in use case 2 like CREATE MATERIALIZED VIEW usecase2.mv2 AS SELECT * FROM usecase1.mv1, how do I blue/green this? Unless I'm missing something, this doesn't work out with the macro today, regardless of whether you try to blue/green use case 1 and use case 2 together or separately. If you do them together, (I think) usecase2.mv2 won't know to depend on usecase1_deploy.mv, and if you do them separately, running deploy_cleanup for usecase1 will drop usecase2.mv2 (eek!).

We definitely don't need to solve this all in our first cut at this, but I think it would behoove us to aim at solving the general problem of dependencies chained across use cases, rather than focusing in on the one specific instance of this problem presented by sources.

Thought 5. Sources in dbt, generally.

Unrelated to this PR, but I think we should seriously consider adjusting our handling of source models in dbt to automatically inject the CREATE SOURCE <name> block. That would fix one of the biggest papercuts with sources, which is that they don't respect the usual configuration for database/schema names.

-- source.sql

FROM KAFKA CONNECTION kc (TOPIC 'foo')
FORMAT JSON
ENVELOPE NONE

Feb 19 '24 22:02 benesch

Thanks for all the feedback everyone!

@josharenberg what do you think about making the lag interval configurable but default to 1s?

@benesch regarding Thought 1 I have a few practical questions. If we generate the names of the deployment schemas / clusters automatically, does that mean users will need to fill in _dbt_deploy for all their model configuration when specifying schemas and clusters? I like injecting the names in theory but I'm concerned it'll be confusing if people need to manually fill in names that match what we generate. I took a very cursory look at injecting the suffix on the fly but I couldn't come up with anything I liked.

Thought 2 / 3 I like it! When we swap a schema out of production we can add a table that has metadata about the deployment. I can think of some clean messages we can wire up with that.

Thought 4 I have a lot of thoughts here but am going to sleep so consider this a teaser for tomorrow.

Feb 20 '24 03:02 sjwiesman

Little wrinkle to using a table with Thought 3 is we can't query it off mz_introspection, so we'll need a healthy user cluster. One option could be to create any object (maybe a view) and store metadata as a comment.

But getting back to Thought 4. There's a lot we can do here and I'd personally like to mostly punt on the problem in this pr. It's a big problem space and I think we need to see users in action to come up with a cogent solution.

There are a few things we can do now. dbt has the manifest file we can use to see which objects are created as part of a single project. We could parse that and then diff it with what's deployed in the schema. We can also put a guardrail in deploy_cleanup that uses mz_object_dependencies to validate that nothing we're about to drop is being depended on by something managed outside the dbt project. For example, if you have a materialized view that is referenced by another team on another cluster in a different schema, we can prevent you from taking that other project offline.

Feb 20 '24 15:02 sjwiesman

Little wrinkle to using a table with Thought 3 is we can't query it off mz_introspection, so we'll need a healthy user cluster. One option could be to create any object (maybe a view) and store metadata as a comment.

I like that option! Now that I think about it, I think you could even comment on the schema itself, to avoid introducing another object to keep track of.

Feb 20 '24 17:02 benesch

Little wrinkle to using a table with Thought 3 is we can't query it off mz_introspection, so we'll need a healthy user cluster. One option could be to create any object (maybe a view) and store metadata as a comment.

I like that option! Now that I think about it, I think you could even comment on the schema itself, to avoid introducing another object to keep track of.

But getting back to Thought 4. There's a lot we can do here and I'd personally like to mostly punt on the problem in this pr. It's a big problem space and I think we need to see users in action to come up with a cogent solution.

I think that's the right call!

There are a few things we can do now. dbt has the manifest file we can use to see which objects are created as part of a single project. We could parse that and then diff it with what's deployed in the schema. We can also put a guardrail in deploy_cleanup that uses mz_object_dependencies to validate that nothing we're about to drop is being depended on by something managed outside the dbt project. For example, if you have a materialized view that is referenced by another team on another cluster in a different schema, we can prevent you from taking that other project offline.

Both of these things sound great. They also both sound like things that could potentially be added in fast follow PRs, to keep this PR more manageable.

Feb 20 '24 17:02 benesch

I have updated the macros based on our previous discussion to autogenerate schemas and clusters and intercept the value specified by users when running dbt run --vars 'deploy: True'.

I've opened up issues for the other items we've discussed. I'm happy to implement them as fast follows, even before we cut the next release, but I'd like to keep the scope of this PR where it is.

Please take a look when you have time, but I'd love to get this merged in quickly.

Feb 21 '24 17:02 sjwiesman

Gorgeous, thanks @sjwiesman! I'm not going to have time to do a thorough review of the code itself but +1 from me on not adding any additional scope to this PR.

Quick favor to ask @sjwiesman: can you update the PR description for the new behavior?

Feb 21 '24 18:02 benesch

Quick favor to ask @sjwiesman: can you update the PR description for the new behavior?

done

Feb 21 '24 18:02 sjwiesman

Amazing! The UX described looks great. Thanks so much for putting this one together, Seth!

Feb 21 '24 18:02 benesch

Opening up #25405 for discussion as a follow-up that will get this working also for sources. I suggest we cut a release with just the B/G deployment workflow, but cut another one with this breaking change and any adjustments needed soon after. Let's not release this to users before the documentation is updated, though (I'm working on it and will push a draft PR tomorrow).

Feb 22 '24 09:02 morsapaes