posthog icon indicating copy to clipboard operation
posthog copied to clipboard

Integrity issue - PostgreSQL and ClickHouse out of sync

Open guidoiaquinti opened this issue 3 years ago • 2 comments

Bug description

PostHog uses two main databases:

  • PostgreSQL

  • ClickHouse

Some functionalities of PostHog rely on having a subset of data in sync between the two (e.g. person properties). We currently try to archive this by double writing to the two datastores directly in the app. Unfortunately, the datasets are far from being in sync as we use nothing like 2 phase commits that would rollback a transaction entirely if one of the write operation fails.

This data drift is creating issues at the application level and in the overall data integrity. I’m creating this issue to keep track of proposals to address the problem.

Environment

  • [X] PostHog Cloud

  • [X] self-hosted PostHog (ClickHouse-based), version/commit: please provide

  • [X] self-hosted PostHog (Postgres-based, legacy), version/commit: please provide

Additional context

  • https://github.com/PostHog/posthog/issues/10058

guidoiaquinti avatar Jun 10 '22 09:06 guidoiaquinti

My proposal is to implement a Change Data Capture process (see as example the section A Common CDC Architecture with Debezium)

guidoiaquinti avatar Jun 10 '22 09:06 guidoiaquinti

This was potentially the cause of https://posthog.slack.com/archives/C0374DA782U/p1658758589681689

tiina303 avatar Jul 25 '22 16:07 tiina303