SDV icon indicating copy to clipboard operation
SDV copied to clipboard

HMASynthesizer diagnostic score is not 1.0 when using `'truncnorm'` distribution

Open npatki opened this issue 1 year ago • 1 comments

Environment Details

  • SDV version: 1.10.0
  • Python version: (any)
  • Operating System: (any)

Error Description

If I update the default distribution to 'truncnorm', then the HMASynthesizer creates synthetic data that is not completely valid. When running the diagnostic report, the Data Validity score is not 100% -- because there are extra NaN/NaT values that appear in the synthetic data.

Steps to reproduce

Replicate this using the attached metadata and data.

from sdv.datasets.local import load_csvs
from sdv.metadata import MultiTableMetadata
from sdv.multi_table import HMASynthesizer

data = load_csvs(folder_name='test_data/')
metadata = MultiTableMetadata.load_from_json('test_metadata.json')

synthesizer = HMASynthesizer(metadata)
for table_name in data.keys():
  synthesizer.set_table_parameters(
    table_name=table_name,
    table_parameters={'default_distribution': 'truncnorm'})

synthesizer.fit(data)
synthetic_data = synthesizer.sample()

diagnostic_report = run_diagnostic(
  real_data=data, synthetic_data=synthetic_data, metadata=metadata)

test_data.zip test_metadata.json

OUTPUT: At first, you'll see many warnings originating by truncated gaussian during modeling:

site-packages/copulas/univariate/truncated_gaussian.py:45: RuntimeWarning: invalid value encountered in scalar divide
site-packages/copulas/univariate/truncated_gaussian.py:46: RuntimeWarning: divide by zero encountered in scalar divide

Then during sampling, there are more warnings that the transformed data (coming directly from ML models) contain null values and therefore overall synthetic data (after reverse sampling) will also have null values.

site-packages/rdt/transformers/utils.py:217: UserWarning: There are null values in the transformed data. The reversed transformed data will contain null values.

Finally, the diagnostic is not 100%:

Overall Score: 94.67%

Properties:
- Data Validity: 84.01%
- Data Structure: 100.0%
- Relationship Validity: 100.0%

Additional Context

This was first observed in #1755

npatki avatar Mar 04 '24 21:03 npatki

@npatki FYI this appears to happen whenever the sampled a value is great than the sampled b value

frances-h avatar Mar 05 '24 22:03 frances-h