SDV Question regarding CTGAN for data synthesis and classification tasks

I am currently using CTGAN to synthesize data and evaluate its utility for classification tasks. I have a question regarding the observed performance when training classifiers on the generated data.

According to the quality report provided by CTGAN, the overall quality score is around 91%, so the synthesized datas should be good for the classification task. However, when I train a classifier using only the real data (performing cross-validation) I achieve about 70% accuracy. But when I try to train the classifier on the synthetic data generated by CTGAN and then classify the test set, the accuracy drops to around 50%.

I also experimented by combining the real and synthetic data in the training set, but the performance remains similar to training solely on real data.

While using a simpler method as SMOTE i achieve far better performances. If i plot the column shapes for example i see that for the numerical columns the distribution of the CTGAN datas is worse than the SMOTE one

I would appreciate any insights or suggestions to better understand these observations and improve the classification performance. Thank you for your attention to this matter.

Best regards, Daniele

Jul 17 '23 09:07 danielemolino

I'm attaching two photos so it is clear to see the differences between the two models.

SMOTE:

Age_SMOTE

CTGAN: Age_CTGAN

Jul 17 '23 09:07 danielemolino

Did you find the reason? I am facing the same issue now

Apr 11 '24 17:04 weeebdev

Hi there @weeebdev and @danielemolino 👋 This is a late reply but hoping it's still useful!

Curious to know the motivation behind using synthetic data if SMOTE is working for you? Maybe that's a better approach in this case?
Quality reports actually report on statistical similarity between synthetic and real data. But good similarity doesn't necessarily imply that the synthetic data is optimal for training ML models.

We've added a lot more bells & whistles to tweak our models that might improve synthetic data quality since July 2023, so I'd encourage you to try them out.

Synthesizer parameters can help you describe certain properties you want the model to follow.
Our Metadata classes help you better provide semantic meaning for your columns for the synthesizer models to use (e.g. which columns represent SSN values, email addresses, or IP addresses). The synthesized data will better adhere to these semantic definitions this way.
As a last resort, Constraints can help you express certain business rules you want the synthetic data to follow.

May 09 '24 21:05 srinify

Hi there @weeebdev and @danielemolino 👋 This is a late reply but hoping it's still useful!

Curious to know the motivation behind using synthetic data if SMOTE is working for you? Maybe that's a better approach in this case?

Quality reports actually report on statistical similarity between synthetic and real data. But good similarity doesn't necessarily imply that the synthetic data is optimal for training ML models.

We've added a lot more bells & whistles to tweak our models that might improve synthetic data quality since July 2023, so I'd encourage you to try them out.

Synthesizer parameters can help you describe certain properties you want the model to follow.

Our Metadata classes help you better provide semantic meaning for your columns for the synthesizer models to use (e.g. which columns represent SSN values, email addresses, or IP addresses). The synthesized data will better adhere to these semantic definitions this way.

As a last resort, Constraints can help you express certain business rules you want the synthetic data to follow.

As for the choice between SMOTE and synthetic data, SMOTE is limited to the number of rows that can be generated (as far as I know), while synthetic data offers more configurable inputs. Also, SMOTE will contain both synthetic and real data, which might be not the case when we want to train only on synthetic data. But I am not a specialist, so I might be wrong.

May 11 '24 17:05 weeebdev

My suspicion is that using a blend of synthetic and real data may be more optimal, because you have some ground-truth examples (real data). I wonder how useful an ML model trained entirely on synthetic data would really be because the model won't learn anything about real-world behavior usually.

But I suppose it depends on the use case

May 13 '24 13:05 srinify

I'm closing this issue out since I replied to the question and there isn't a clear bug or feature request here! If something else comes to mind, please don't hesitate to open a new issue @danielemolino

May 21 '24 14:05 srinify