SDV
SDV copied to clipboard
Add utility function to simplify multi-table schemas
Problem Description
Currently, HMA cannot run on certain multi-table schemas. We issue a warning when a schema will generate too many columns, and we should provide a utility function to easily reduce a multi-table schema so it can successfully run on HMA.
Expected behavior
Add a new utility function utils.simplify_schema:
Parameters:
-
data- the data dictionary -
metadata- the MultiTableMetadata for this dataset
Returns:
- A data dictionary mapping table names to simplified tables
-
MultiTableMetadatafor the simplified data schema
from sdv.utils import simplify_schema
simple_data, simple_metadata = simplify_schema(
data=my_data,
metadata=my_metadata
)
Algorithm overview
For every root table:
- drop any table that is depth > 2 away from the parent (i.e. keep only direct children and grandchildren) and count the number of tables connected to the root
Select the root with the greatest number of descendant tables Calculate the number of extended columns we can add to the root (we can reuse the logic used to generate the warning in HMA) Allocate a # of augmented columns to each child relationship For each child:
- Determine the number of modelable columns and add the number of child relationships for that child
- If the number of modelable columns will generate more than the allowed number of extended columns, drop modelable columns from the child
- Try to keep a variety of sdtypes
- If we cannot drop columns so that we will not exceed the maximum number of extended_columns, drop any grandchild tables until we can
For each grandchild:
- Drop all modelable columns (grandchildren should only generate a
num_rowscolumn in their parents)
Additional context
We should also change the warning in HMA to point to this utility function:
>>> synthesizer = HMASynthesizer(metadata)
PerformanceAlert: Using the HMASynthesizer on this metadata schema is not recommended because HMA will generate a large number of columns
Table Name # Columns in Metadata Est # Columns
users 12 123123123
transactions
...
We recommend simplifying your metadata schema using utils.simplify_schema