data-diff icon indicating copy to clipboard operation
data-diff copied to clipboard

Incorrect handling of composite PK with null values

Open leoebfolsom opened this issue 2 years ago • 2 comments

--dbt summary results don't make sense ("Updated Rows: 2" but no columns contain updated values -- this is conflicting).

There is a warning that rows with NULL keys are skipped, but that still doesn't explain the conflict.

(dbt-analytics) ➜  dbt-analytics git:(make-renewal-not-upsell-current-opp) data-diff --dbt --select stg_google_sheets__datafold_customers
Running with data-diff=0.7.12
00:45:59 WARNING  NULL values in one or more primary keys of ('ANALYTICS', 'ANALYTICS', 'STG_GOOGLE_SHEETS__DATAFOLD_CUSTOMERS'). Skipping rows with NULL keys.                                                     joindiff_tables.py:262
         WARNING  NULL values in one or more primary keys of ('DEV', 'DBT_LFOLSOM', 'STG_GOOGLE_SHEETS__DATAFOLD_CUSTOMERS'). Skipping rows with NULL keys.                                                         joindiff_tables.py:262

ANALYTICS.ANALYTICS.STG_GOOGLE_SHEETS__DATAFOLD_CUSTOMERS <> DEV.DBT_LFOLSOM.STG_GOOGLE_SHEETS__DATAFOLD_CUSTOMERS 

  Rows Added    Rows Removed
------------  --------------
           0               0

Updated Rows: 2
Unchanged Rows: 31

Values Updated:
HUBSPOT_COMPANY_NAME: 0
SHEET_REV_START: 0
SHEET_BOOKED: 0
SHEET_SEQ: 0
SHEET_REV_END: 0 

Datafold Cloud seems to handle it correctly due to its PK analysis, if that is helpful context, at least to demonstrate what the actual diff between the tables is.

Screenshot 2023-06-29 at 00 52 18

I think that if --dbt was correctly omitting rows with NULL keys, Updated Rows would be 0.

leoebfolsom avatar Jun 29 '23 07:06 leoebfolsom

Hmm we should probably instead have something like :

Updated Rows: 0
Unchanged Rows: 31
Skipped Rows: 2

dlawin avatar Jul 10 '23 17:07 dlawin

I think Skipped Rows makes sense in this context. The key is that we should never have a value for Updated Rows be >0 if none of the column-level Values Updated are >0.

I think ideally, we should say something about why they are skipped. Otherwise I would not understand why they were skipped, or without context, what even meant to "skip" a row.

Something like:

Updated Rows: 0
Unchanged Rows: 31
Skipped Rows Due to NULL or Duplicate PKs: 2

I realize it's stated in the WARNING. But less technical users will read past logs and look only at the output.

I believe the output should independently communicate what the user needs to know to understand the results.

leoebfolsom avatar Jul 10 '23 18:07 leoebfolsom