data-diff seemingly incorrect "ValueError: Duplicate primary keys" error

Describe the bug

I was testing out the tool on a model that should have essentially zero diff against the production model, just one column modified slightly. It was more a POC to see the tool in action than anything. But I am getting a “ValueError: Duplicate primary keys” error. My table in both prod and in dev passes my unique_key test on the same primary key column that I specified for the data-diff configuration. Can someone help me figure out if I’m doing something wrong here?

This was reported in the dbt Slack: https://getdbt.slack.com/archives/C03D25A92UU/p1682373672200739

I'll request the user add additional info.

Make sure to include the following (minus sensitive information):

The command or code you used
The run output + error you're getting. (including tracestack)
Run data-diff with the -d switch for extra debug information.

If possible, please paste these as text, and not a screenshot.

Describe the environment

Describe which OS you're using, which data-diff version, and any other information that might be relevant to this bug.

Apr 24 '23 23:04 leoebfolsom

This is being caused by a disconnect between dbt's definition of "unique" and data-diff's definition.

Dbt's uniqueness test (note that it excludes nulls):

select
    key as unique_field,
    count(*) as n_records

from DEV.DBT_DEV_DAN.some_table
where key is not null
group by key
having count(*) > 1

Data-diff (nulls are counted):

SELECT count(*) AS "total"
, count(distinct coalesce(cast("ID" as string), '<null>')) AS "total_distinct" 
FROM "INTEGRATION"."BEERS_TEST"."SIMPLE_EXAMPLE"

If "total" != "total_distinct" -> throw a duplicate PK error

Options I see:

Warn that there are null PKs and skip diffing them?
Throw a more verbose error "Detected null PK(s)" or similar

Apr 25 '23 17:04 dlawin

@dlawin

Options I see: Warn that there are null PKs and skip diffing them? Throw a more verbose error "Detected null PK(s)" or similar

I might vote for something like:

"null primary keys detected, skipping records with null keys"

[run]

and include a count of records with null keys in the output

Apr 25 '23 18:04 kylemcnair

@kylemcnair

Options I see: Warn that there are null PKs and skip diffing them? Throw a more verbose error "Detected null PK(s)" or similar

I might vote for something like:

"null primary keys detected, skipping records with null keys"

[run]

and include a count of records with null keys in the output

My only hesitation with that is modifying the existing behavior for non --dbt joindiffs

Probably fairly edge case, but I think we should add a error_on_null boolean to joindiff so that we can leave it as is outside of --dbt

Apr 25 '23 20:04 dlawin

FWIW in the --dbt case I would personally prefer to have it still run the diff and skip or ignore rows with null PK values, because if I'm comparing local changes against something already in production, and it is the version in production which is already messed up with null PK values, I may not have the ability to "fix" it in order to run the diff properly. Does that make sense? In any case, I think this aligns with the direction the thread seems to be going!

Apr 26 '23 23:04 mariahjrogers

FWIW in the --dbt case I would personally prefer to have it still run the diff and skip or ignore rows with null PK values, because if I'm comparing local changes against something already in production, and it is the version in production which is already messed up with null PK values, I may not have the ability to "fix" it in order to run the diff properly. Does that make sense? In any case, I think this aligns with the direction the thread seems to be going!

@mariahjrogers That makes sense to me and I think I agree. I'd rather get some data diffed (with a warning about my nulls) than have the whole thing fail.

@dlawin I agree about making this --dbt specific

Apr 27 '23 16:04 kylemcnair

Have a couple users reporting this isn't working for compound keys

Aug 28 '23 16:08 dlawin