insert_overwrite strategy can miss data on node failover
When using the insert_overwrite strategy, it's possible that not all the statements run on the same node, leading to some data not being propagated.
The sequence of operation in https://github.com/ClickHouse/dbt-clickhouse/blob/57db2b2c87243273c44a1e8abd96a17ae3ebf806/dbt/include/clickhouse/macros/materializations/incremental/incremental.sql is:
- Create new data table
- Insert-select to new data table
- Create temporary table
- Insert-select into temporary table from new data table
- Find partition_id in temporary table by querying system.parts
-
ALTER TABLE main_table REPLACE PARTITION ID partition_id FROM <temporary table>
If the node changes between steps 5 and 6 because of a failover, the REPLACE PARTITION can execute without moving any data since not all nodes will immediately know about all the new data inserted in step 4.
ClickHouse will not throw an error when replacing a partition with no parts (not a good default and something we should fix as well) and in the logs this will show up with a log line containing REPLACE cloning 0 parts for partition in system.text_log for the REPLACE PARTITION query.
One way to address this would be to run a SYSTEM SYNC REPLICA ON CLUSTER <cluster> <db>.<table> LIGHTWEIGHT after step 4 to make sure all nodes have all the information about the new parts.
A suggested solution is currently under review #394. We'll review it in the next following days.
Hi, great that you caught this!
As I see it, this is a broader issue in dbt-clickhouse. In all incremental models, the process first creates intermediate_relation/new_data_relation (example), where data is written, and then inserts it into the target table (with or without a delete).
If you have a setup with 2+ replicas and a load balancer distributing queries across them, there's a risk that:
- intermediate_relation is written to replica_1
- The second query, which transfers data to the target table, reads from replica_2 before data has been replicated/
A possible solution is to enforce settings insert_quorum=<number_of_nodes> for all inserts in dbt-clickhouse. However, while this improves consistency, it reduces reliability—if one node goes down, the whole cluster becomes unavailable. The same effect applies to SYSTEM SYNC REPLICA, as it forces a replica to wait for synchronization, potentially causing downtime if any replica is lagging or unavailable.
I think this should be a user-defined setting in profiles.yml, allowing users to decide on this trade-off themselves.