Gracefully handle invalidated Postgres replication slots
What version of Materialize are you using?
v0.26.1-dev (0a77973af)
How did you install Materialize?
Built from source
What is the issue?
While trying to restart a Postgres source after several days without using it, I saw the following messages in the logs:
storage-u10: 2022-09-17T19:49:08.224182Z INFO mz_storage::source::postgres: resuming replication for source u10
storage-u10: 2022-09-17T19:49:09.191685Z WARN mz_storage::source::postgres: replication for source u10 interrupted, retrying: db error: ERROR: cannot read from logical replication slot "materialize_3b766e55e96e4c5e8e2e9cbe14c80068"
DETAIL: This slot has been invalidated because it exceeded the maximum reserved size.
This caused storaged to get stuck in an infinite retry loop.
This Postgres database was configured with max_slot_wal_keep_size to ensure the disk space wouldn't be filled up by orphaned slots.
Relevant log output
storage-u10: 2022-09-17T19:49:08.224182Z INFO mz_storage::source::postgres: resuming replication for source u10
storage-u10: 2022-09-17T19:49:09.191685Z WARN mz_storage::source::postgres: replication for source u10 interrupted, retrying: db error: ERROR: cannot read from logical replication slot "materialize_3b766e55e96e4c5e8e2e9cbe14c80068"
DETAIL: This slot has been invalidated because it exceeded the maximum reserved size.
What would you consider graceful in this situation, @andrioni? Marking the source as permanently errored in the (to be implemented) source errors collection?
Marking it permanently as errored (and potentially exposing this fact on queries as an additional NOTICE message), or potentially re-extracting the source from scratch (with a new snapshot and replication slot).
Former is better for cases where users would prefer to have more control and/or would prefer to avoid random re-snapshots (which can add significant load on the DB), latter is better for that "just works" experience.
Right now, the data we had before the snapshot also stays queryable: we should also decide if that's the behavior we want here, or if we'd prefer to fail queries to make the issue more noticeable.
Marking it permanently as errored (and potentially exposing this fact on queries as an additional
NOTICEmessage), or potentially re-extracting the source from scratch (with a new snapshot and replication slot).Former is better for cases where users would prefer to have more control and/or would prefer to avoid random re-snapshots (which can add significant load on the DB), latter is better for that "just works" experience.
Makes sense. I've got a preference for the former for sure. I think we should make it easy to manually trigger a re-snapshot, if you do lose a replication slot, but resnapshotting without manual action seems pretty scary.
Right now, the data we had before the snapshot also stays queryable: we should also decide if that's the behavior we want here, or if we'd prefer to fail queries to make the issue more noticeable.
I think it definitely is! We need to make it really easy to monitor the errors collection, and to surface those errors loudly in the UI and such, but I don't think we want to potentially take down applications by failing queries every time there's a minor network interruption between Materialize and an upstream source.
Fixed by https://github.com/MaterializeInc/materialize/pull/23467.