ClusterBasedNormalizer vs GaussianNormalizer vs PowerTransformer

Open candalfigomoro opened this issue 2 years ago • 1 comments

When using CTGAN, data is normalized using ClusterBasedNormalizer.

In RDT, GaussianNormalizer is also implemented.

What are the advantages of ClusterBasedNormalizer and GaussianNormalizer compared to using sklearn's PowerTransformer (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer.html) with the Yeo-Johnson method? Couldn't a power transform be used instead (which would perhaps be faster than ClusterBasedNormalizer)?

Thank you

Feb 13 '23 11:02 candalfigomoro

Hi @candalfigomoro, thanks for the feedback. We'll keep this issue open to share any information as we investigate the specifics of this transformers.

Some considerations:

Quality: Does this significantly improve the quality when used to create synthetic data? To evaluate quality, we use the SDMetrics quality report
Performance: How quickly is this transformer able to fit, transform and reverse transform compared to the others?
Memory: What would be the overall file size if you were to save a synthesizer that used this transformer vs. others?

If you have done any exploration yourself along these lines, we'd be very eager to see it!

Mar 29 '23 20:03 npatki