CTGAN icon indicating copy to clipboard operation
CTGAN copied to clipboard

Improve correlation for low cardinality, discrete data

Open Kaabachi opened this issue 4 years ago • 2 comments

Environment Details

Please indicate the following details about the environment in which you found the bug:

  • CTGAN version: 0.4.3
  • Python version: 3.8.5
  • Operating System: Linux

Error Description

When I generate data with CTGAN on numerical data, the correlation is not maintained between the different columns ( see figure ). It is possible to reproduce my experiment below:

corr

Steps to reproduce

Unfortunately I have this problem with most of the datasets I'm using even when I'm trying to recreate some of the work that has been done in recent papers. I provide here a small example of a dummy dataset.

# create dummy data, where column "1" and "2" are clearly correlated while the "3" is just random
dummy_data = pd.DataFrame({
    '1': [1,2] * 100,
    '2': [4 ,5] * 100,
    "3": [random.randint(0,9) for x in range(200)],
})


# fit CTGAN on said data
ctgan = CTGAN()
ctgan.fit(dummy_data)

# generate samples
samples = ctgan.sample(1000)

Kaabachi avatar Oct 21 '21 12:10 Kaabachi

Hi @Kaabachi, thanks for filing. I am seeing some improved results with the newest version of CTGAN, particularly if I increase the number of epochs. Below is an example of running with 2500 epochs.

image

Another challenge might be the fact that there are only 2 unique values in the dummy data, columns '1' and '2'. Depending on the meaning of the data, it might make more sense to model these as categorical variables, not numerical. Which recent work are you replicating? Does it frequently feature columns with a low number of distinct values?

npatki avatar Jul 13 '22 19:07 npatki

From trying out CTGAN on a variety of datasets, we are seeing that it is generally converging to better and better quality correlations, especially for continuous data. Since your example is for low cardinality, discrete data, I'll make this issue more specific to that. I suspect that this may work better if (a) it is modeled as categorical or (b) if we pre-process such data to add noise.

npatki avatar Aug 08 '22 19:08 npatki

Hello, I'm closing this issue off since we've had some discussion and we're seeing some improvements.

The issue is a bit old too -- we just released a new, SDV 1.0 library that will also take advantage of the latest techniques in data preprocessing. Check it out and let's create a new issue if this problem persists!

npatki avatar Mar 29 '23 17:03 npatki