SDV icon indicating copy to clipboard operation
SDV copied to clipboard

Add update_transformers to synthesizers

Open amontanez24 opened this issue 3 years ago • 2 comments

Problem Description

As a user, it would be helpful to have ways to manually set custom transformers to use on my data before modeling.

Expected behavior

  • Add update_transformers method to BaseSynthesizer
  • Parameters:
    • column_name_to_transformer (dict): A dictionary mapping the name of the column to the transformer instance.
  • Method should update the HyperTransformer based of the provided dict
  • Validation:
    • Errors: (Raise if any 1 or more columns encounter the case. Do the checks first. We shouldn't partially update anything.)
      • Updating a transformer that is incompatible with the sdtype provided in the metadata Error: Column 'age' is a numerical column, which is incompatible with the 'LabelEncoder' preprocessing.
      • Adding a transformer other than AnonymizedFaker or RegexGenerator for a key column (primary, alternate, sequence key) Error: Column 'user_id' is a key. It cannot be preprocessed using the 'FloatFormatter' transformer.
      • The user is assigning a transformer object that has already been fit Error: Transformer for column 'age' has already been fit on data.
    • Warnings: Raise all that arise
      • (CTGAN, CopulaGAN, TVAE, PAR only): Whenever the user tries to add a transformer for a column that is auto-assigned to None (boolean/categorical) Warning: Replacing the default transformer for column 'degree_type' might impact the quality of your synthetic data
      • (GaussianCopula): Whenever the user is adding a OneHotEncoder to a categorical column Warning: Using the OneHotEncoder for column 'degree_type' may slow down the preprocessing and modeling time

amontanez24 avatar Sep 21 '22 06:09 amontanez24

@amontanez24 Could you clarify what you meant by Whenever the user tries to add a transformer for a column that is auto-assigned to None (boolean/categorical)

fealho avatar Oct 05 '22 19:10 fealho

@amontanez24 Could you clarify what you meant by Whenever the user tries to add a transformer for a column that is auto-assigned to None (boolean/categorical)

For CTGAN, CopulaGAN and TVAE, the categorical and boolean transformations are skipped. Instead of using the default categorical transformer for them, we should use None. If a user tries to change that, we raise the warning but let them do it since it won't technically break

amontanez24 avatar Oct 05 '22 19:10 amontanez24