CTGAN
CTGAN copied to clipboard
TVAE: Code not properly capturing correlation between categorical and numerical columns
If we run the code below, we will see that sampled correctly keeps the relationship between the categorical values ~99% of the time:
data = pd.DataFrame({
'1': ['1.0', '2.0', '3.0'] * 150,
'2': ['a', 'b', 'c'] * 150
})
tvae = TVAESynthesizer(epochs=300)
tvae.fit(data, ['1', '2'])
sampled = tvae.sample(1000)
However, if we change column 1 to be numerical, the model is unable to understand the relationship between the columns, and the matches are pretty much random:
data = pd.DataFrame({
'1': [1.0, 2.0, 3.0] * 150,
'2': ['a', 'b', 'c'] * 150
})
tvae = TVAESynthesizer(epochs=300)
tvae.fit(data, ['2'])
sampled = tvae.sample(1000)
print(sampled.head(20))
This returns:
1 2
0 3.008063 c
1 0.998053 a
2 2.989118 b
3 2.999388 b
4 2.989977 b
5 3.001034 a
6 3.002145 c
7 2.991087 a
8 1.998736 a
9 2.993332 a
10 2.987284 b
11 2.997205 c
12 3.003766 b
13 2.992854 b
14 1.995608 a
15 2.990174 b
16 2.991794 c
17 2.988642 b
18 2.988973 a
19 1.011714 b