Support key columns of any data type
Is your feature request related to a problem? Please describe.
I am trying to run a diff on a table without one single primary key column, so I'm passing in multiple key columns. Some of these columns are numeric, some are timestamps, and some are VARCHARs. I get an error thrown by the tool when it encounters the TIMESTAMP column: ERROR - Cannot use a column of type Timestamp(precision=6, rounds=True) as a key. I would happily pass it in as a VARCHAR, but I can't modify the type of the field in the table directly, so I tried passing <tstamp_field>::VARCHAR as the key column value, and then got this error: ERROR - Column 'date_day::varchar' not found in table 1.
Describe the solution you'd like Data-diff should support key columns of any type and handle different types behind the scenes.
Failing that, it would be great to be able to pass in a column with a type cast and not get a "column not found" error.
Describe alternatives you've considered One alternative to be able to run my diff with non-VARCHAR key columns would be for me to change the types of the columns in my production table, which isn't an acceptable alternative.
Another alternative would be to generate a new column which is a composite key, but this also isn't an acceptable alternative because it requires first modifying the production version of the table before being able to run a diff against it, when the whole point is that I want to diff my changes before making any impact on the production table.
Additional context
Hi I am using data-diff to mainly diff tables between a production database and a test database. In this case the scenario is an ERP system which has tables with following structures.
objid , objkey, name (Text) , description (Text)
So ofcourse the objid values are different between the two databases but the Name column is a column which is unique for the system.
So ideally I would like to use the name column as my key column so that I can compare the tables between the two systems.
Currently this is not possible, but the system guarantees that the name column really acts like a key column even though it is just a Text() column.
So what would be a good approach to solve this and diff the tables by "forcing" data-diff to treat the column as an id so somehow overrule it to say as a user, yes, I can assure you that this column is a key column?
Is there anything particular you need with regards to key columns that you do not accept TEXT() as a key column?
Thanks!
@mariahjrogers @koenvb which database providers are you encountering this with?
@dlawin For my use case I am using the oracle connection. I fetch it straight from the erp system.
I just started using data-diff to compare data across Postgres databases. I have the same issue as above in that the unique key on some tables is one or more varchar columns. data-diff gives an error "Cannot use a column of type Text() as a key" This is not an uncommon scenario, even if the table has a surrogate id key, but the comparison needs to be on textual data. How can this be done with data-diff?
Context on why this is not supported yet:
We currently support certain strings as PKs. They need to be alphanums with a fixed length (and only these characters)
-0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz
The reason for this is that we need to be able to perform arithmetic operations on the keys in order to segment and bisect the tables when the database is of type ThreadedDatabase. It’s possible but not trivial to implement this for arbitrary types
For fixed length alphanums, you can see we convert them to numbers in order to support them as keys https://github.com/datafold/data-diff/blob/312a9c564b0db7e8825159b4c62c36164fa9e448/data_diff/utils.py#L181
Hi all!
I'm sorry for the delay in following up on this. @mariahjrogers thank you for taking the time to raise this issue!
We made a hard decision to sunset the data-diff package and won't provide further development or support.
If that's of interest, over the past few months, we have rewritten the diffing engine in Datafold Cloud and solved many issues that existed in this package's diffing algorithm.
-Gleb