stack create Docker builds and add docker-compose config for self-hosting

By popular demand, I bring you... Docker builds! This PR adds support for building/starting the entire project with a single command and provides a base docker-compose.yml that can be used as a reference for how to self-host Stack using Docker builds instead of Vercel.

TL;DR

To test this out, you can clone my fork and just run the new Docker startup command...

git clone --single-branch -b docker-builds https://github.com/jshimko/stack.git stack-docker
cd stack-docker

pnpm docker:up

# which is just a convenient alias for...
# docker compose up -d && docker compose logs -f dashboard backend

When running this for the first time, Docker Compose will build the dashboard and backend Docker images and then start them both once complete. It will then tail the logs of both so you can watch for any issues or watch request logs. Once the containers are all started, you should now be able to see the dashboard at http://localhost:8101 and should be able to sign in. That's it! The database is already migrated and a new "self-host" seed script has been run inside the container to ensure the required starting data exists.

Details

The first time you run pnpm docker:up will build the Docker images, but if you change code and want to create a new build, you will have to explicitly run pnpm docker:build and then pnpm docker:up again. There's also a pnpm docker:reset command will kills all of the containers and deletes all data volumes so you can quickly start over from scratch with fresh databases/services. When the db is empty on first start, the backend will automatically migrate and re-seed the database again. And if there are new migrations in a build, the backend will apply them at startup. See the root package.json for more details on these new commands.

Most of the Dockerfile and docker-compose stuff is fairly self-explanatory and I tried to include plenty of comments for anyone that might be digging into those bits. Other than that, I just had to make a few small tweaks to code to be able to support build/deployment in Docker rather than Vercel. The biggest one would be the support for runtime environment variables on the client side.

NextJS runtime env

As you probably know, NextJS compiles all NEXT_PULIC_ env vars into static values when running next build. This isn't an issue on Vercel, but it's definitely an issue in a Docker build because it would mean that you need to hard code runtime values like URL's or analytics API keys into your Docker images and that means you'd have to create a new build for every URL change even though the code hasn't changed. Fortunately, you can use a package called next-runtime-env to fix this problem (see README for detailed explanation). The only changes required to adopt that package are to replace every process.env.NEXT_PUBLIC_ value in the code with that package's env('NEXT_PUBLIC_WHATEVER') helper. That ensures the values set at runtime are always up to date and aren't hard coded in the build output. That said, I replaced every process.env.NEXT_PUBLIC_X variable with env('NEXT_PUBLIC_X') in the dashboard and backend (mostly the dashboard).

Related to all of that, I also realized that the @stackframe/stack package was using several process.env.NEXT_PUBLIC_X values as well, so that meant we'd need the same thing there. However, I didn't want to force next-runtime-env to be a dependency of that package. I also didn't want to create any conflicts for customers that may already be using next-runtime-env because that package writes the NEXT_PUBLIC_ values to window.__ENV in the browser and having two instances of that could potentially result in stepping on each others' toes when writing values to the window object. To ensure that's not a possibility, I created a simple helper component that does the exact same thing as next-runtime-env except writes internal Stack vars to window.__STACK_ENV__. You can find that in packages/stack/src/lib/env. So it solves the same problem in a couple dozen lines of code with no third party dependency and removes the possibility of conflicts when people are already using next-runtime-env.

New NEXT_PUBLIC_INSECURE_COOKIE config

Since Docker builds result in a Node app running with NODE_ENV=production, that meant cookies were expected to be coming from a URL with https://. That meant that running the builds with docker-compose on localhost wasn't able to work and resulted in an infinite loop of redirects in the browser. To work around that limitation, I added a new env var called NEXT_PUBLIC_INSECURE_COOKIE that allows you to use the dashboard on localhost when NODE_ENV=production. I added detailed comments about this in the code and noted that it should NEVER be used in production. See packages/stack/src/lib/cookie.ts for details.

Self host seed script

Instead of messing with the existing seed script that is used for local dev, I created a new one that is intended to be used inside the backend Docker container on first startup. It's pretty self-explanatory if you read through it. The main difference from the original seed script is it allows you to pass a few new env vars that can configure a default admin user and optionally disable signups to the internal project. Both the admin user and the signup disable are optional and both default to being skipped, so the default behavior is you sign up to create your initial account just like the original seed script. The downside of that though is you won't have access to the internal Stack Dashboard project once in the dashboard. You'll be a member of it, but you won't be able to manage it or add other users. The new seed script admin user does get access to that project, so that's probably what most people will prefer when self-hosting.

You can find these new env options in the new .env file at apps/backend/.env.docker. Note that the new docker-compose.yml loads that file automatically. Also note that I added a apps/dashboard/.env.docker file as well so that Docker deployments and local dev have their own distinct configs. This is important for several reasons. First, because the old docker-compose config used for local dev is still in place so that pnpm dev functions the same way it already did. That means the new docker-compose config has it's own databases, etc. and the URL's and db ports are different to avoid any conflicts or confusion. Second, you can't set URL env vars to http://localhost:PORT inside a Docker container because localhost is no longer your machine in that context. It's the container itself. So that means you need a URL that resolves to the right place from within the container, but also resolves to localhost outside the container. Fortunately, Docker has a solution for this with host.docker.internal. In short, any URL's that have localhost in them in local dev needed to be converted to host.docker.internal when running in a container. See Docker's workaround docs for that here. That is enabled in the docker-compose config by these lines on the backend and dashboard...

  backend:
    ...
    extra_hosts:
      - "host.docker.internal:host-gateway"

Docker should already have taken care of this hostname resolution for you when it was installed, but just in case it didn't, you can add the following to your /etc/hosts file...

127.0.0.1 host.docker.internal

CI Docker Builds

Lastly, I added a Github workflow to build and publish the new Docker builds. All that is required to get them working is to provide 3 secrets in your repo or org configs.

DOCKER_REPO
DOCKER_USER
DOCKER_PASSWORD

So, for example, if you create a stack-auth org on Docker Hub, DOCKER_REPO would just be your org name of stack-auth and the user/password values could be for any user that has push access to your account.

As for the build tags that are created by this workflow, there are 3 different potential tag formats. Any time you merge to dev or main, the workflow will build and push two tags - one is the short SHA from the commit (first 7 chars) and the other is the branch name. So a commit to dev would look like this:

# assuming you go with the `stack-auth` org on Docker Hub

stack-auth/stack-dashboard:abc1234
stack-auth/stack-dashboard:dev

stack-auth/stack-backend:abc1234
stack-auth/stack-backend:dev

In this case, the :dev branch tag will always be an alias for the latest build on the dev branch while the :abc1234 tag is the specific commit hash. This allows users to just pull the "latest" of dev or pin to a specific commit. The main branch builds work the same way.

I also configured it to build when tagged with a version number. I know you don't currently use tags on your releases, but I think it'd be really helpful if you did so it's clear when package versions have actually changed. But also because that triggers a special tag format in the docker/metadata-action that automates the tagging. When you tag a commit with a version number in the format of 1.2.3, that tells the metadata action that this is an official release and the resulting Docker builds that get pushed will be in the format:

stack-auth/stack-dashboard:1.2.3
stack-auth/stack-dashboard:latest

stack-auth/stack-backend:1.2.3
stack-auth/stack-backend:latest

That allows users to pick a specific production release or just always pull the "latest" stable prod release.

docker pull stack-auth/stack-dashboard:latest
docker pull stack-auth/stack-backend:latest

Misc

I also updated Prisma to the latest 5.20.0 release. I know that probably seems unrelated, but this was because you were previously on a fairly old version and that version didn't have a Prisma binary available for linux/arm64 architecture. The reason this mattered was anyone trying to build on a new M series Macbook would get an error when running pnpm install inside the Debian container that these Docker images are based on. Updating to the a more recent Prisma version solved this.

Ok, I think that's everything. I've been using all of this in our own Kubernetes deployments for the last couple weeks while iterating on things and everything is really stable for me, so I think I smoothed out all the rough edges. Let me know if you have any questions of if there's anything else I can do!

Sep 26 '24 20:09 jshimko

@jshimko is attempting to deploy a commit to the Stack Team on Vercel.

A member of the Team first needs to authorize it.

Sep 26 '24 20:09 vercel[bot]

This is excellent, thank you so much for your contribution!

From a first glance, most of this seems great. My main concern is about the handling of the environment variables; this PR (and next-env-runtime) uses noStore, which disables static site generation and partial pre-rendering, both for ourselves and our customers who use the @stackframe/stack package.

I would actually argue that requiring a Docker image rebuild when updating envvars is by design; this way, the statically generated files (and with those the initial Next.js response) can contain as much information as possible. This means we can't easily publish a Docker image to a registry, though. IMO this is an acceptable tradeoff; if you're self-hosting, you're setting yourself up to struggle with much harder things than just rebuilding a Docker image from scratch. (DB migrations, for example.)

What do you think?

Oct 08 '24 02:10 N2D4

Thanks for the review @N2D4! So, a couple thoughts...

First, noStore only disables static rendering for the component that it is used in. A few lines from their docs...

unstable_noStore can be used to declaratively opt out of static rendering and indicate a particular component should not be cached.

unstable_noStore is preferred over export const dynamic = 'force-dynamic' as it is more granular and can be used on a per-component basis.

In this particular case, we're only effecting a single NextJS <Script/> component. Other components outside of that should render the same as previously.

As for hard coding environment variables in a Docker builds, that is very much against best practices for a variety of reasons, but particularly because it breaks the best practice of Docker builds always being stateless (also long considered a best practice for software in general by the classic "12 Factor App" methodology). Docker even mentions that in their best practices for building under the section "Create ephemeral containers". Which also links directly to the 12 Factor site...

Refer to Processes under The Twelve-factor App methodology to get a feel for the motivations of running containers in such a stateless fashion.

See also Factor 3 - Config: https://12factor.net/config

Requiring every user to build their own images just to set a dynamic URL also means nobody can even use the same build between dev, staging, production, etc. And the only reason for that is because the URL changes between those envs. Supporting runtime config entirely solves that. If I deploy a build for my staging environment and thoroughly test everything, I need to be able to use that same build again when I promote it to production. Otherwise a new Docker build doesn't guarantee everything is 100% the same. At the very least, it's unnecessary overhead to create a duplicate build that only changes a few environment variables.

Perhaps more importantly, hard coding env into Docker builds means that nobody can ever publish reusable Docker builds. That one is kind of a non-starter for us. For example, all of our Kubernetes deployments (dev/staging/production) are completely automated and the release process goes from 1) development (latest of main branch) to 2) staging (a commit tagged for release) to 3) production (same tagged release promoted from staging) and the same Docker builds are used from end to end. Having to build and publish a new image for every deployment adds a lot of opportunities for build inconsistencies, needless CI/CD complexity, and tons of Docker image storage that would all otherwise be avoided by supporting runtime configs in a single reusable build.

Lastly, having Docker builds be reusable means that you (Stack Auth) can publish "official" builds (using the Github workflow I added in this PR). Since most self-host users aren't modifying the code, most won't even need to bother with the build step. There's a big advantage to having a single official source of production builds that everyone uses. This completely removes the "works with my build" debugging headache that will inevitably turn into a time consuming community support nightmare. With official builds, the only thing that differs between users is the config they pass in. That greatly reduces the amount of things that could be preventing a deployment from working correctly. If it works for one properly configured deployment, it should work for every properly configured deployment because you can be sure it is 100% the same code and build output. That literally allows you point to working configs in the docs and just say "works on my machine"! :)

Unrelated, just wanted to clarify on this comment about migrations...

IMO this is an acceptable tradeoff; if you're self-hosting, you're setting yourself up to struggle with much harder things than just rebuilding a Docker image from scratch. (DB migrations, for example.)

Migrations are actually automated in the Docker builds. They run on backend container startup in the entrypoint script (which can be disabled with the STACK_SKIP_MIGRATIONS env var if needed). So any time a new migration is released, it will automatically apply when the new backend build is deployed. I'd argue this is even easier than deploying on Vercel where you need to manually apply migrations to your database and try to time it with the code release that depends on it. Not running on startup also assumes all self-host users will even be aware that a new Stack release has new migrations in it. That's why automation is key here. As you know, deploying code that expects migrations to have been run can very easily lead to production downtime when Prisma falls over due to schema mismatches. Automating migrations ensures the app can't even start up until the new migrations have been successfully applied. Also, manually doing stuff to a production database is no fun!

So, not sure if I made a convincing case here, but happy to answer any questions or clarify anything further if you're still concerned about these changes. Let me know what you think!

Oct 08 '24 14:10 jshimko

In this particular case, we're only effecting a single NextJS <Script/> component. Other components outside of that should render the same as previously.

What this means in practice is that this <Script /> component will suspend, which will pause the rendering of all components up to the closest Suspense boundary. Since <StackProvider /> is at the very top of the layout.tsx, this would be the entire page. We could wrap <StackProvider /> in a Suspense boundary, but in that case the <Script /> would not be the first script that runs on the page, and window.__STACK_ENV__ will not be available when the statically rendered components are hydrated in the browser.

Migrations are actually automated in the Docker builds. They run on backend container startup in the entrypoint script (which can be disabled with the STACK_SKIP_MIGRATIONS env var if needed). So any time a new migration is released, it will automatically apply when the new backend build is deployed. [...]

That only works if you don't worry about downtime during migrations. Think of the following scenario when renaming a column from A to B:

DB v1, server v1: The initial version where the col is named A.
DB v2, server v1: We add a new col named B.
DB v2, server v2: We update the server to write to both A and B, but it still reads from A.
DB v3, server v2: We do a database migration where we copy A to B.
DB v3, server v3: We update the server to write and read from B only.
DB v4, server v3: We delete A.

You need to coordinate the server and DB updates when migrating, but if you don't and you do all the updates at the same time (particularly if you're running multiple revisions of the server at the same time as part of your autoscaling/rollout), things will break. This is acceptable if you're fine with a few minutes of downtime, but not otherwise.

I get your point regarding configless container builds. Cal.com has a similar setup and they publish a Docker image that's designed for local usage, alongside a build script to customize it for production; we could have something like that. The other alternative is that we disable static rendering/PPR on the Docker version, and essentially do what you did here, while keeping it in the main deployment. I'll think about this a bit today.

Oct 08 '24 19:10 N2D4

I talked to the Next.js team and they have an --experimental-build-mode CLI flag that we should be able to use to not inline variables. This would do what we want, though I'm not sure if the behavior has already been released with Next.js 15 or not — at least it's a new avenue though.

Oct 29 '24 17:10 N2D4

@jshimko @N2D4 Great work happening here. Would love to try it out soon.

Oct 29 '24 20:10 dbjpanda

@jshimko Can you give us the permission to edit this PR?

Nov 27 '24 12:11 fomalhautb

Continued in #353

Dec 01 '24 14:12 fomalhautb