CTGAN Improve correlation for low cardinality, discrete data

Environment Details

Please indicate the following details about the environment in which you found the bug:

CTGAN version: 0.4.3
Python version: 3.8.5
Operating System: Linux

Error Description

When I generate data with CTGAN on numerical data, the correlation is not maintained between the different columns ( see figure ). It is possible to reproduce my experiment below:

corr

Steps to reproduce

Unfortunately I have this problem with most of the datasets I'm using even when I'm trying to recreate some of the work that has been done in recent papers. I provide here a small example of a dummy dataset.

# create dummy data, where column "1" and "2" are clearly correlated while the "3" is just random
dummy_data = pd.DataFrame({
    '1': [1,2] * 100,
    '2': [4 ,5] * 100,
    "3": [random.randint(0,9) for x in range(200)],
})


# fit CTGAN on said data
ctgan = CTGAN()
ctgan.fit(dummy_data)

# generate samples
samples = ctgan.sample(1000)

Oct 21 '21 12:10 Kaabachi

Hi @Kaabachi, thanks for filing. I am seeing some improved results with the newest version of CTGAN, particularly if I increase the number of epochs. Below is an example of running with 2500 epochs.

Another challenge might be the fact that there are only 2 unique values in the dummy data, columns '1' and '2'. Depending on the meaning of the data, it might make more sense to model these as categorical variables, not numerical. Which recent work are you replicating? Does it frequently feature columns with a low number of distinct values?

Jul 13 '22 19:07 npatki

From trying out CTGAN on a variety of datasets, we are seeing that it is generally converging to better and better quality correlations, especially for continuous data. Since your example is for low cardinality, discrete data, I'll make this issue more specific to that. I suspect that this may work better if (a) it is modeled as categorical or (b) if we pre-process such data to add noise.

Aug 08 '22 19:08 npatki

Hello, I'm closing this issue off since we've had some discussion and we're seeing some improvements.

The issue is a bit old too -- we just released a new, SDV 1.0 library that will also take advantage of the latest techniques in data preprocessing. Check it out and let's create a new issue if this problem persists!

Mar 29 '23 17:03 npatki