SDV icon indicating copy to clipboard operation
SDV copied to clipboard

Support Nullable foreign keys

Open dfosticupgrade opened this issue 2 years ago • 7 comments

Problem Description

i have the following scenario, two tables Parent(id, name, ….) & Child(id, parent_id, name ….) both tables have id as primary key table Child have parent_id as a foreign key referencing Parent id, but Child’s parent_id can also be null i have following configuration:

metadata.update_column(table_name='parent_table', column_name='id', sdtype='id')
metadata.update_column(table_name='child_table', column_name='id', sdtype='id')
metadata.update_column(table_name='child_table', column_name='primary_parent_id', sdtype='id')

metadata.add_relationship(
    parent_table_name='parent_table',
    child_table_name='child_table',
    parent_primary_key='id',
    child_foreign_key='primary_parent_id'
)

when i try to train synthesizer i get the error

Relationships:
Error: foreign key column 'primary_parent_id' contains unknown references: (nan). All the values in this column must reference a primary key.

Expected behaviour

Child table should be able to have null values in foreign key table

Additional context

https://sdv-space.slack.com/archives/C01HKBP7V08/p1698247849922839

dfosticupgrade avatar Oct 26 '23 12:10 dfosticupgrade

Thank you for filing the issue @dfosticupgrade. The SDV does not currently support nullable foreign keys, so we'll make sure to leave this as a feature request and use it to track any updates.

Workaround

One workaround you can use is to create a dummy parent that represents the missing values. This would go something like this:

  • Inject 1 row in the parent table with a special primary key, eg 9999, and average values in the rest of the columns.
  • Then in the child table, any null foreign key should instead be linked to the dummy parent 9999

Then, there will be no null foreign keys and the SDV will be able to model this data. (Of course, the SDV is not designed to create any synthetic data with null foreign keys either, which is a drawback of this workaround.)

npatki avatar Oct 26 '23 13:10 npatki

Thank you @npatki! you approach works but i run into the following issue: now all records generated don't take in account original id, they generate some random numbers, which will not exists in the child, correlation is broken

dfosticupgrade avatar Oct 27 '23 18:10 dfosticupgrade

@dfosticupgrade, could you share your metadata as well as some examples/screenshots of the synthetic data? If something is specified as a foreign key in the metadata, then the synthetic child should always refer to a synthetic parent.

npatki avatar Oct 30 '23 18:10 npatki

A nullable foreign key is a foreign key that can have a value of NULL. This means that the column can contain missing or null values, indicating that the related record does not exist in the referenced table. In other words, it allows for "gaps" in the data, where there are no related records in the referenced table.

A classical situation is the employee table, where the manager field is a foreign key to the same table. But the top managers field is empty in this case.

Is it possible, that you enable the support of nullable foreign keys? Without this feature working with relational data is not possible.

AdventureWorks dataset as CSV, source: https://github.com/olafusimichael/AdventureWorksCSV

gszecsenyi avatar Nov 07 '23 20:11 gszecsenyi

Hi @gszecsenyi thanks for providing such a compelling example for this. Nullable foreign keys are officially supported by SQL and I agree that this would be a good feature to add.

Do note that this request may take longer for us to fulfill, since it likely requires some core algorithmic changes to the synthesizers (i.e. it is not a surface-level bug fix).

In the meantime, do feel free to use the workaround mentioned in my earlier comment to unblock your project and let us know if you have an questions.

npatki avatar Nov 14 '23 15:11 npatki

@npatki you approach works, but the cardinality plot i am getting looks like his

image

dfosticupgrade avatar Nov 14 '23 19:11 dfosticupgrade

Hi @dfosticupgrade, congrats on getting synthetic data!

The particular quality issue you're encountering is being discussed in #1673, so we can further discuss there.

npatki avatar Nov 17 '23 14:11 npatki

Hi all -- the team is actively working to support nullable foreign keys. Since we have multiple models, we've created a dedicated issues for HMASynthesizer in #2063. You can follow along to monitor progress and see updates.

I'm closing this one off as a duplicate of the above. Though if there's something I'm missing, please free free to comment below and I can always re-open this issue to investigate. Thanks.

npatki avatar Jun 13 '24 15:06 npatki