Support Nullable foreign keys
Problem Description
i have the following scenario, two tables Parent(id, name, ….) & Child(id, parent_id, name ….) both tables have id as primary key table Child have parent_id as a foreign key referencing Parent id, but Child’s parent_id can also be null i have following configuration:
metadata.update_column(table_name='parent_table', column_name='id', sdtype='id')
metadata.update_column(table_name='child_table', column_name='id', sdtype='id')
metadata.update_column(table_name='child_table', column_name='primary_parent_id', sdtype='id')
metadata.add_relationship(
parent_table_name='parent_table',
child_table_name='child_table',
parent_primary_key='id',
child_foreign_key='primary_parent_id'
)
when i try to train synthesizer i get the error
Relationships:
Error: foreign key column 'primary_parent_id' contains unknown references: (nan). All the values in this column must reference a primary key.
Expected behaviour
Child table should be able to have null values in foreign key table
Additional context
https://sdv-space.slack.com/archives/C01HKBP7V08/p1698247849922839
Thank you for filing the issue @dfosticupgrade. The SDV does not currently support nullable foreign keys, so we'll make sure to leave this as a feature request and use it to track any updates.
Workaround
One workaround you can use is to create a dummy parent that represents the missing values. This would go something like this:
- Inject 1 row in the parent table with a special primary key, eg
9999, and average values in the rest of the columns. - Then in the child table, any null foreign key should instead be linked to the dummy parent
9999
Then, there will be no null foreign keys and the SDV will be able to model this data. (Of course, the SDV is not designed to create any synthetic data with null foreign keys either, which is a drawback of this workaround.)
Thank you @npatki! you approach works but i run into the following issue: now all records generated don't take in account original id, they generate some random numbers, which will not exists in the child, correlation is broken
@dfosticupgrade, could you share your metadata as well as some examples/screenshots of the synthetic data? If something is specified as a foreign key in the metadata, then the synthetic child should always refer to a synthetic parent.
A nullable foreign key is a foreign key that can have a value of NULL. This means that the column can contain missing or null values, indicating that the related record does not exist in the referenced table. In other words, it allows for "gaps" in the data, where there are no related records in the referenced table.
A classical situation is the employee table, where the manager field is a foreign key to the same table. But the top managers field is empty in this case.
Is it possible, that you enable the support of nullable foreign keys? Without this feature working with relational data is not possible.
AdventureWorks dataset as CSV, source: https://github.com/olafusimichael/AdventureWorksCSV
Hi @gszecsenyi thanks for providing such a compelling example for this. Nullable foreign keys are officially supported by SQL and I agree that this would be a good feature to add.
Do note that this request may take longer for us to fulfill, since it likely requires some core algorithmic changes to the synthesizers (i.e. it is not a surface-level bug fix).
In the meantime, do feel free to use the workaround mentioned in my earlier comment to unblock your project and let us know if you have an questions.
@npatki you approach works, but the cardinality plot i am getting looks like his
Hi @dfosticupgrade, congrats on getting synthetic data!
The particular quality issue you're encountering is being discussed in #1673, so we can further discuss there.
Hi all -- the team is actively working to support nullable foreign keys. Since we have multiple models, we've created a dedicated issues for HMASynthesizer in #2063. You can follow along to monitor progress and see updates.
I'm closing this one off as a duplicate of the above. Though if there's something I'm missing, please free free to comment below and I can always re-open this issue to investigate. Thanks.