electric icon indicating copy to clipboard operation
electric copied to clipboard

Electric fails to detect when its persistent storage goes away and prevents Postgres from cleaning up WAL files

Open alco opened this issue 1 year ago • 6 comments

Imagine a simple scenario where Electric is running inside a Kubernetes pod with no persistent storage. Some shapes get created, so Electric creates a publication in Postgres and starts processing transactions.

When the pod is restarted, a new file system is created for it with no traces of the previous shape storage. Electric will no longer be able to process incoming transactions from Postgres, causing the latter to build up its WAL backlog indefinitely.

Electric should be able to detect this failure mode and drop transactions when there is no active shape collector process instead of keeping those transactions around by refusing to advance the replication slot.

alco avatar Sep 17 '24 07:09 alco

N.b.: https://discord.com/channels/933657521581858818/1285476835412541516

thruflo avatar Sep 17 '24 08:09 thruflo

I think this is also related to https://github.com/electric-sql/electric/issues/1774 - if Electric boots up and has no shapes, it should update the replication slot accordingly. I feel that there should be a mechanism/service that keeps the publication properly configured, and perhaps that should also inform how to handle "deprecated" transactions

msfstef avatar Oct 08 '24 14:10 msfstef

For multi-tenancy, losing storage would mean we don't know where the databases are, so we couldn't clean up the publications. But multi-tenancy is an advanced use case so perhaps we can get away with not dealing with it.

robacourt avatar Nov 05 '24 17:11 robacourt

Electric should be able to detect this failure mode and drop transactions when there is no active shape collector process instead of keeping those transactions around by refusing to advance the replication slot.

If I read correctly, when a fresh Electric finds a pre-exiting replication slot it will find that it is in an inconsistent state with local metadata (which is empty) and doesn't use it. This is a conservative approach in cases the owner of the replication slot connects later.

if Electric boots up and has no shapes, it should update the replication slot accordingly.

@msfstef, meaning just drop the replication slot and recreate it right? shall we make sure this intention intentional, --force-recreatea-replication-slot or being more optimistic since it's important to make sure that we cleanup the WAL and shapes are cheap anyways (cc @KyleAMathews @robacourt )

balegas avatar Nov 05 '24 17:11 balegas

Is there any way we could tag the replication slot with the electric deployment? that way if it sees a replication slot but no active shapes it will know that it owns that replication slot and can clean up appropriately. some instance id environment variable?

magnetised avatar Apr 14 '25 14:04 magnetised

The replication slot does have a name now, so I think we're halfway there. We also acquire an advisory lock before starting consuming the replication slot protects gives protection against instances competing with each other.

We use a default name, but maybe that is good enough?

balegas avatar Apr 14 '25 21:04 balegas

Currently, we have a name for the slot, and after my changes, we have a thing that force-resets the publication on startup to match loaded shapes. I think this issue can be closed

icehaunter avatar Jun 09 '25 09:06 icehaunter

Yeah, a lot has changed since this was originally reported.

alco avatar Jun 09 '25 09:06 alco