CTGAN icon indicating copy to clipboard operation
CTGAN copied to clipboard

TVAE: Code not properly capturing correlation between categorical and numerical columns

Open fealho opened this issue 4 years ago • 0 comments

If we run the code below, we will see that sampled correctly keeps the relationship between the categorical values ~99% of the time:

data = pd.DataFrame({
    '1': ['1.0', '2.0', '3.0'] * 150,
    '2': ['a', 'b', 'c'] * 150
})

tvae = TVAESynthesizer(epochs=300)
tvae.fit(data, ['1', '2'])

sampled = tvae.sample(1000)

However, if we change column 1 to be numerical, the model is unable to understand the relationship between the columns, and the matches are pretty much random:

data = pd.DataFrame({
    '1': [1.0, 2.0, 3.0] * 150,
    '2': ['a', 'b', 'c'] * 150
})

tvae = TVAESynthesizer(epochs=300)
tvae.fit(data, ['2'])

sampled = tvae.sample(1000)
print(sampled.head(20))

This returns:

           1  2
0   3.008063  c
1   0.998053  a
2   2.989118  b
3   2.999388  b
4   2.989977  b
5   3.001034  a
6   3.002145  c
7   2.991087  a
8   1.998736  a
9   2.993332  a
10  2.987284  b
11  2.997205  c
12  3.003766  b
13  2.992854  b
14  1.995608  a
15  2.990174  b
16  2.991794  c
17  2.988642  b
18  2.988973  a
19  1.011714  b

fealho avatar Mar 12 '21 00:03 fealho