Redundant whitespace in the demo data
Hi everyone! First of all, thanks for all the work on this fantastic library and the Synthetic Data Vault in general :). I believe I found a minor bug in loading the demo data set and propose a quick fix for which I will submit a PR.
Environment Details
- CTGAN version: latest (0.5.2.dev1)
- Python version: 3.9.7
- Operating System: Windows
Error Description
When running the usage example for the CTGANSynthesizer with conditional sampling via the condition_column and condition_value arguments in the sample method:
samples = ctgan.sample(1000, condition_column='native-country', condition_value='United-States')
I get the following error:
rdt\transformers\categorical.py:374: UserWarning: The data contains 1 new categories that were not seen in the original data (examples: {'United-States'}). Creating a vector of all 0s. If you want to model new categories, please fit the transformer again with the new data.
After looking into it, I found out that the discrete variables contain redundant whitespace in front of the categories. Using ' United-States' (with the redundant whitespace) works fine:
samples = ctgan.sample(1000, condition_column='native-country', condition_value=' United-States')
Solution
I propose to set the skipinitialspace argument in the pd.read_csv to True in the load_demo function:
def load_demo():
"""Load the demo."""
return pd.read_csv(DEMO_URL, compression='gzip', skipinitialspace=True)
This seems to solve the issue.
Steps to reproduce
from ctgan import CTGANSynthesizer
from ctgan import load_demo
data = load_demo()
# Names of the columns that are discrete
discrete_columns = [
'workclass',
'education',
'marital-status',
'occupation',
'relationship',
'race',
'sex',
'native-country',
'income'
]
ctgan = CTGANSynthesizer(epochs=1)
ctgan.fit(data, discrete_columns)
# Synthetic copy
samples = ctgan.sample(1000, condition_column='native-country', condition_value='United-States')
Hi @AndresAlgaba, nice to meet you and thanks for bringing this to our attention.
The root cause was probably the way this original data was exported. While your suggestion would solve the issue for this particular demo, we'd prefer to fix the format of the underlying data itself. We can apply the same principle to any future demo datasets that may be slightly off a true csv format (in different ways).
I suggest we repurpose this bug for reformatting original demo data as a proper csv file. For now, we can suggest everyone to use your manual workaround of reading the csv with skipinitialspace=True.
Hi @npatki, nice to meet you too! It's my pleasure; thanks to the team for the effort on SDV and the quick response.
Yes, I agree. Is there anything which I can help with? I have already opened a PR.
Unfortunately we don't have public write access to the S3 bucket, which is needed to make this change. We'll add this to our backlog and update the bug when we have a fix.
Thanks for your offer to help!
Okay, thanks, and no problem!