dbt-clickhouse icon indicating copy to clipboard operation
dbt-clickhouse copied to clipboard

insert_overwrite strategy can miss data on node failover

Open cwurm opened this issue 1 year ago • 2 comments

When using the insert_overwrite strategy, it's possible that not all the statements run on the same node, leading to some data not being propagated.

The sequence of operation in https://github.com/ClickHouse/dbt-clickhouse/blob/57db2b2c87243273c44a1e8abd96a17ae3ebf806/dbt/include/clickhouse/macros/materializations/incremental/incremental.sql is:

  1. Create new data table
  2. Insert-select to new data table
  3. Create temporary table
  4. Insert-select into temporary table from new data table
  5. Find partition_id in temporary table by querying system.parts
  6. ALTER TABLE main_table REPLACE PARTITION ID partition_id FROM <temporary table>

If the node changes between steps 5 and 6 because of a failover, the REPLACE PARTITION can execute without moving any data since not all nodes will immediately know about all the new data inserted in step 4.

ClickHouse will not throw an error when replacing a partition with no parts (not a good default and something we should fix as well) and in the logs this will show up with a log line containing REPLACE cloning 0 parts for partition in system.text_log for the REPLACE PARTITION query.

One way to address this would be to run a SYSTEM SYNC REPLICA ON CLUSTER <cluster> <db>.<table> LIGHTWEIGHT after step 4 to make sure all nodes have all the information about the new parts.

cwurm avatar Feb 10 '25 14:02 cwurm

A suggested solution is currently under review #394. We'll review it in the next following days.

BentsiLeviav avatar Feb 10 '25 15:02 BentsiLeviav

Hi, great that you caught this!

As I see it, this is a broader issue in dbt-clickhouse. In all incremental models, the process first creates intermediate_relation/new_data_relation (example), where data is written, and then inserts it into the target table (with or without a delete).

If you have a setup with 2+ replicas and a load balancer distributing queries across them, there's a risk that:

  1. intermediate_relation is written to replica_1
  2. The second query, which transfers data to the target table, reads from replica_2 before data has been replicated/

A possible solution is to enforce settings insert_quorum=<number_of_nodes> for all inserts in dbt-clickhouse. However, while this improves consistency, it reduces reliability—if one node goes down, the whole cluster becomes unavailable. The same effect applies to SYSTEM SYNC REPLICA, as it forces a replica to wait for synchronization, potentially causing downtime if any replica is lagging or unavailable.

I think this should be a user-defined setting in profiles.yml, allowing users to decide on this trade-off themselves.

pheepa avatar Feb 11 '25 15:02 pheepa