CTGAN icon indicating copy to clipboard operation
CTGAN copied to clipboard

Redundant whitespace in the demo data

Open AndresAlgaba opened this issue 3 years ago • 4 comments

Hi everyone! First of all, thanks for all the work on this fantastic library and the Synthetic Data Vault in general :). I believe I found a minor bug in loading the demo data set and propose a quick fix for which I will submit a PR.

Environment Details

  • CTGAN version: latest (0.5.2.dev1)
  • Python version: 3.9.7
  • Operating System: Windows

Error Description

When running the usage example for the CTGANSynthesizer with conditional sampling via the condition_column and condition_value arguments in the sample method:

samples = ctgan.sample(1000, condition_column='native-country', condition_value='United-States')

I get the following error: rdt\transformers\categorical.py:374: UserWarning: The data contains 1 new categories that were not seen in the original data (examples: {'United-States'}). Creating a vector of all 0s. If you want to model new categories, please fit the transformer again with the new data.

After looking into it, I found out that the discrete variables contain redundant whitespace in front of the categories. Using ' United-States' (with the redundant whitespace) works fine:

samples = ctgan.sample(1000, condition_column='native-country', condition_value=' United-States')

Solution

I propose to set the skipinitialspace argument in the pd.read_csv to True in the load_demo function:

def load_demo():
    """Load the demo."""
    return pd.read_csv(DEMO_URL, compression='gzip', skipinitialspace=True)

This seems to solve the issue.

Steps to reproduce

from ctgan import CTGANSynthesizer
from ctgan import load_demo

data = load_demo()

# Names of the columns that are discrete
discrete_columns = [
    'workclass',
    'education',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'native-country',
    'income'
]

ctgan = CTGANSynthesizer(epochs=1)
ctgan.fit(data, discrete_columns)

# Synthetic copy
samples = ctgan.sample(1000, condition_column='native-country', condition_value='United-States')

AndresAlgaba avatar Jul 19 '22 09:07 AndresAlgaba

Hi @AndresAlgaba, nice to meet you and thanks for bringing this to our attention.

The root cause was probably the way this original data was exported. While your suggestion would solve the issue for this particular demo, we'd prefer to fix the format of the underlying data itself. We can apply the same principle to any future demo datasets that may be slightly off a true csv format (in different ways).

I suggest we repurpose this bug for reformatting original demo data as a proper csv file. For now, we can suggest everyone to use your manual workaround of reading the csv with skipinitialspace=True.

npatki avatar Jul 19 '22 15:07 npatki

Hi @npatki, nice to meet you too! It's my pleasure; thanks to the team for the effort on SDV and the quick response.

Yes, I agree. Is there anything which I can help with? I have already opened a PR.

AndresAlgaba avatar Jul 20 '22 11:07 AndresAlgaba

Unfortunately we don't have public write access to the S3 bucket, which is needed to make this change. We'll add this to our backlog and update the bug when we have a fix.

Thanks for your offer to help!

npatki avatar Jul 21 '22 14:07 npatki

Okay, thanks, and no problem!

AndresAlgaba avatar Jul 24 '22 19:07 AndresAlgaba