normalize descriminator's output
The outputs of descriminator_loss and generator_loss are supposed to be between [0, 1], but looking at the code it seems that x in the output can be any value.
I fixed the code. Since it is used for other calculations, it is not possible to check the operation, but please check it.
The discriminator's codes are based on https://github.com/jaywalnut310/vits/blob/main/models.py, I don't know if the original codes will decrease the performance, but it seems that values outside 0 to 1 don't cause error? Are there some experiment results that can compare the performance about modifying the sigmoid layer?
This was a bug I found while rewriting the descriptor. I was able to get the VC to work with the modification including the above code when I trained it with other many changes, but I haven't been able to make a fair comparison yet.
def discriminator_loss(disc_real_outputs, disc_generated_outputs):
loss = 0
r_losses = []
g_losses = []
for dr, dg in zip(disc_real_outputs, disc_generated_outputs):
dr = dr.float()
dg = dg.float()
r_loss = torch.mean((1 - dr) ** 2)
g_loss = torch.mean(dg**2)
loss += r_loss + g_loss
r_losses.append(r_loss.item())
g_losses.append(g_loss.item())
return loss, r_losses, g_losses
def generator_loss(disc_outputs):
loss = 0
gen_losses = []
for dg in disc_outputs:
dg = dg.float()
l = torch.mean((1 - dg) ** 2)
gen_losses.append(l)
loss += l
return loss, gen_losses
Look at the code above. This is part of the descriptor and generator loss. In the descriminator_loss, the loss is the smallest when the value created from the correct sound source is close to 1 and the value created from the synthesized sound source is close to 0. In generator_loss, the loss is minimized when the value created from the synthesized sound source is close to 1.
From this, normalizing values between 0 and 1 with the sigmoid function should theoretically lead to better performance.
However, the current learning is working and it is not a fatal bug, so I would appreciate it if someone from the development team could do a comparative experiment.
For the sake of testing this change, I patched the modification into https://github.com/ddPn08/rvc-webui and ran two 200 epoch 30 minute data training runs from scratch with no pretrained model.
This PR's code:
The vanilla model's code:
I put the tensorboard logfiles in the RVC discord for manual review if you want to look at the full graphing.
Unfortunately, this code adjusts the state dictionary keys for the discriminator so a re-train from scratch would need to be done for the base models if it's merged.
After some changes by Nadare to fix loading pretrained models, we ran an additional test with a pretrained model and identified that this change negatively impacts model quality based on the results of the training run.
Same training conditions and data as the previous test, but warm started from a checkpoint.
Vanilla Model:
Normalized Model:
It was written in 2.3 of the VITS paper that the sigmoid function is not used and the least squares error is used. This is a method proposed in LSGAN, and was introduced to solve the problem that the error becomes too small when using cross-entropy. Sorry for the confusion. This PR will be closed.