seemingly incorrect "ValueError: Duplicate primary keys" error
Describe the bug
I was testing out the tool on a model that should have essentially zero diff against the production model, just one column modified slightly. It was more a POC to see the tool in action than anything. But I am getting a “ValueError: Duplicate primary keys” error. My table in both prod and in dev passes my unique_key test on the same primary key column that I specified for the data-diff configuration. Can someone help me figure out if I’m doing something wrong here?
This was reported in the dbt Slack: https://getdbt.slack.com/archives/C03D25A92UU/p1682373672200739
I'll request the user add additional info.
Make sure to include the following (minus sensitive information):
- The command or code you used
- The run output + error you're getting. (including tracestack)
- Run data-diff with the
-dswitch for extra debug information.
If possible, please paste these as text, and not a screenshot.
Describe the environment
Describe which OS you're using, which data-diff version, and any other information that might be relevant to this bug.
This is being caused by a disconnect between dbt's definition of "unique" and data-diff's definition.
Dbt's uniqueness test (note that it excludes nulls):
select
key as unique_field,
count(*) as n_records
from DEV.DBT_DEV_DAN.some_table
where key is not null
group by key
having count(*) > 1
Data-diff (nulls are counted):
SELECT count(*) AS "total"
, count(distinct coalesce(cast("ID" as string), '<null>')) AS "total_distinct"
FROM "INTEGRATION"."BEERS_TEST"."SIMPLE_EXAMPLE"
If "total" != "total_distinct" -> throw a duplicate PK error
Options I see:
- Warn that there are null PKs and skip diffing them?
- Throw a more verbose error "Detected null PK(s)" or similar
@dlawin
Options I see: Warn that there are null PKs and skip diffing them? Throw a more verbose error "Detected null PK(s)" or similar
I might vote for something like:
"null primary keys detected, skipping records with null keys"
[run]
and include a count of records with null keys in the output
@kylemcnair
Options I see: Warn that there are null PKs and skip diffing them? Throw a more verbose error "Detected null PK(s)" or similar
I might vote for something like:
"null primary keys detected, skipping records with null keys"
[run]and include a count of records with null keys in the output
My only hesitation with that is modifying the existing behavior for non --dbt joindiffs
Probably fairly edge case, but I think we should add a error_on_null boolean to joindiff so that we can leave it as is outside of --dbt
FWIW in the --dbt case I would personally prefer to have it still run the diff and skip or ignore rows with null PK values, because if I'm comparing local changes against something already in production, and it is the version in production which is already messed up with null PK values, I may not have the ability to "fix" it in order to run the diff properly. Does that make sense? In any case, I think this aligns with the direction the thread seems to be going!
FWIW in the
--dbtcase I would personally prefer to have it still run the diff and skip or ignore rows with null PK values, because if I'm comparing local changes against something already in production, and it is the version in production which is already messed up with null PK values, I may not have the ability to "fix" it in order to run the diff properly. Does that make sense? In any case, I think this aligns with the direction the thread seems to be going!
@mariahjrogers That makes sense to me and I think I agree. I'd rather get some data diffed (with a warning about my nulls) than have the whole thing fail.
@dlawin I agree about making this --dbt specific
Have a couple users reporting this isn't working for compound keys