Incorrect handling of composite PK with null values
--dbt summary results don't make sense ("Updated Rows: 2" but no columns contain updated values -- this is conflicting).
There is a warning that rows with NULL keys are skipped, but that still doesn't explain the conflict.
(dbt-analytics) ➜ dbt-analytics git:(make-renewal-not-upsell-current-opp) data-diff --dbt --select stg_google_sheets__datafold_customers
Running with data-diff=0.7.12
00:45:59 WARNING NULL values in one or more primary keys of ('ANALYTICS', 'ANALYTICS', 'STG_GOOGLE_SHEETS__DATAFOLD_CUSTOMERS'). Skipping rows with NULL keys. joindiff_tables.py:262
WARNING NULL values in one or more primary keys of ('DEV', 'DBT_LFOLSOM', 'STG_GOOGLE_SHEETS__DATAFOLD_CUSTOMERS'). Skipping rows with NULL keys. joindiff_tables.py:262
ANALYTICS.ANALYTICS.STG_GOOGLE_SHEETS__DATAFOLD_CUSTOMERS <> DEV.DBT_LFOLSOM.STG_GOOGLE_SHEETS__DATAFOLD_CUSTOMERS
Rows Added Rows Removed
------------ --------------
0 0
Updated Rows: 2
Unchanged Rows: 31
Values Updated:
HUBSPOT_COMPANY_NAME: 0
SHEET_REV_START: 0
SHEET_BOOKED: 0
SHEET_SEQ: 0
SHEET_REV_END: 0
Datafold Cloud seems to handle it correctly due to its PK analysis, if that is helpful context, at least to demonstrate what the actual diff between the tables is.
I think that if --dbt was correctly omitting rows with NULL keys, Updated Rows would be 0.
Hmm we should probably instead have something like :
Updated Rows: 0
Unchanged Rows: 31
Skipped Rows: 2
I think Skipped Rows makes sense in this context. The key is that we should never have a value for Updated Rows be >0 if none of the column-level Values Updated are >0.
I think ideally, we should say something about why they are skipped. Otherwise I would not understand why they were skipped, or without context, what even meant to "skip" a row.
Something like:
Updated Rows: 0
Unchanged Rows: 31
Skipped Rows Due to NULL or Duplicate PKs: 2
I realize it's stated in the WARNING. But less technical users will read past logs and look only at the output.
I believe the output should independently communicate what the user needs to know to understand the results.