ClusterBasedNormalizer vs GaussianNormalizer vs PowerTransformer
When using CTGAN, data is normalized using ClusterBasedNormalizer.
In RDT, GaussianNormalizer is also implemented.
What are the advantages of ClusterBasedNormalizer and GaussianNormalizer compared to using sklearn's PowerTransformer (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer.html) with the Yeo-Johnson method? Couldn't a power transform be used instead (which would perhaps be faster than ClusterBasedNormalizer)?
Thank you
Hi @candalfigomoro, thanks for the feedback. We'll keep this issue open to share any information as we investigate the specifics of this transformers.
Some considerations:
- Quality: Does this significantly improve the quality when used to create synthetic data? To evaluate quality, we use the SDMetrics quality report
- Performance: How quickly is this transformer able to fit, transform and reverse transform compared to the others?
- Memory: What would be the overall file size if you were to save a synthesizer that used this transformer vs. others?
If you have done any exploration yourself along these lines, we'd be very eager to see it!