Electric fails to detect when its persistent storage goes away and prevents Postgres from cleaning up WAL files
Imagine a simple scenario where Electric is running inside a Kubernetes pod with no persistent storage. Some shapes get created, so Electric creates a publication in Postgres and starts processing transactions.
When the pod is restarted, a new file system is created for it with no traces of the previous shape storage. Electric will no longer be able to process incoming transactions from Postgres, causing the latter to build up its WAL backlog indefinitely.
Electric should be able to detect this failure mode and drop transactions when there is no active shape collector process instead of keeping those transactions around by refusing to advance the replication slot.
N.b.: https://discord.com/channels/933657521581858818/1285476835412541516
I think this is also related to https://github.com/electric-sql/electric/issues/1774 - if Electric boots up and has no shapes, it should update the replication slot accordingly. I feel that there should be a mechanism/service that keeps the publication properly configured, and perhaps that should also inform how to handle "deprecated" transactions
For multi-tenancy, losing storage would mean we don't know where the databases are, so we couldn't clean up the publications. But multi-tenancy is an advanced use case so perhaps we can get away with not dealing with it.
Electric should be able to detect this failure mode and drop transactions when there is no active shape collector process instead of keeping those transactions around by refusing to advance the replication slot.
If I read correctly, when a fresh Electric finds a pre-exiting replication slot it will find that it is in an inconsistent state with local metadata (which is empty) and doesn't use it. This is a conservative approach in cases the owner of the replication slot connects later.
if Electric boots up and has no shapes, it should update the replication slot accordingly.
@msfstef, meaning just drop the replication slot and recreate it right? shall we make sure this intention intentional, --force-recreatea-replication-slot or being more optimistic since it's important to make sure that we cleanup the WAL and shapes are cheap anyways (cc @KyleAMathews @robacourt )
Is there any way we could tag the replication slot with the electric deployment? that way if it sees a replication slot but no active shapes it will know that it owns that replication slot and can clean up appropriately. some instance id environment variable?
The replication slot does have a name now, so I think we're halfway there. We also acquire an advisory lock before starting consuming the replication slot protects gives protection against instances competing with each other.
We use a default name, but maybe that is good enough?
Currently, we have a name for the slot, and after my changes, we have a thing that force-resets the publication on startup to match loaded shapes. I think this issue can be closed
Yeah, a lot has changed since this was originally reported.