materialize Gracefully handle invalidated Postgres replication slots

What version of Materialize are you using?

v0.26.1-dev (0a77973af)

How did you install Materialize?

Built from source

What is the issue?

While trying to restart a Postgres source after several days without using it, I saw the following messages in the logs:

storage-u10: 2022-09-17T19:49:08.224182Z  INFO mz_storage::source::postgres: resuming replication for source u10
storage-u10: 2022-09-17T19:49:09.191685Z  WARN mz_storage::source::postgres: replication for source u10 interrupted, retrying: db error: ERROR: cannot read from logical replication slot "materialize_3b766e55e96e4c5e8e2e9cbe14c80068"
DETAIL: This slot has been invalidated because it exceeded the maximum reserved size.

This caused storaged to get stuck in an infinite retry loop.

This Postgres database was configured with max_slot_wal_keep_size to ensure the disk space wouldn't be filled up by orphaned slots.

Relevant log output

storage-u10: 2022-09-17T19:49:08.224182Z  INFO mz_storage::source::postgres: resuming replication for source u10
storage-u10: 2022-09-17T19:49:09.191685Z  WARN mz_storage::source::postgres: replication for source u10 interrupted, retrying: db error: ERROR: cannot read from logical replication slot "materialize_3b766e55e96e4c5e8e2e9cbe14c80068"
DETAIL: This slot has been invalidated because it exceeded the maximum reserved size.

Sep 17 '22 19:09 andrioni

What would you consider graceful in this situation, @andrioni? Marking the source as permanently errored in the (to be implemented) source errors collection?

Sep 18 '22 22:09 benesch

Marking it permanently as errored (and potentially exposing this fact on queries as an additional NOTICE message), or potentially re-extracting the source from scratch (with a new snapshot and replication slot).

Former is better for cases where users would prefer to have more control and/or would prefer to avoid random re-snapshots (which can add significant load on the DB), latter is better for that "just works" experience.

Right now, the data we had before the snapshot also stays queryable: we should also decide if that's the behavior we want here, or if we'd prefer to fail queries to make the issue more noticeable.

Sep 19 '22 07:09 andrioni

Marking it permanently as errored (and potentially exposing this fact on queries as an additional NOTICE message), or potentially re-extracting the source from scratch (with a new snapshot and replication slot).

Former is better for cases where users would prefer to have more control and/or would prefer to avoid random re-snapshots (which can add significant load on the DB), latter is better for that "just works" experience.

Makes sense. I've got a preference for the former for sure. I think we should make it easy to manually trigger a re-snapshot, if you do lose a replication slot, but resnapshotting without manual action seems pretty scary.

Right now, the data we had before the snapshot also stays queryable: we should also decide if that's the behavior we want here, or if we'd prefer to fail queries to make the issue more noticeable.

I think it definitely is! We need to make it really easy to monitor the errors collection, and to surface those errors loudly in the UI and such, but I don't think we want to potentially take down applications by failing queries every time there's a minor network interruption between Materialize and an upstream source.

Sep 20 '22 02:09 benesch

Fixed by https://github.com/MaterializeInc/materialize/pull/23467.

Jan 28 '24 06:01 benesch