latent-diffusion icon indicating copy to clipboard operation
latent-diffusion copied to clipboard

Questions about reconstruction and adversarial losses

Open lukau2357 opened this issue 1 year ago • 3 comments

lukau2357 avatar Jan 30 '25 17:01 lukau2357

As far as I know, AutoencoderKL is more of an AE (Autoencoder) than a VAE (Variational Autoencoder). Therefore, in the original code, the difference between L1 and L2 is not very significant. Additionally, StabilityAI later developed a new version of AutoencoderKL that uses MSE as the loss function.

Can you elaborate on the doubts regarding the generator loss?

lavinal712 avatar Mar 11 '25 01:03 lavinal712

I agree that L1 is sensible for classical autoencoders, but I'm pretty sure that AutoencoderKL class actually implements a VAE, because they map an image to a Gaussian, and then they use reparameterization trick in the decoder. In particular, for the forward pass of AutoencoderKL: mapping to a Gaussian, calling reparameterization trick in the decoder, code for reparameterization trick, so my original question about AutoencoderKL still stands - when prior $p(z)$, approximate posterior $p_{\theta}(z \mid x)$ and conditional data likelihood $p_{W}(x \mid z)$ are all Gaussians one should use L2 instead of L1, along with KL divergence component.

Classical/Vanilla GAN losses are the following, for discriminator: (F.softplus(-real_pred) + F.softplus(fake_pred)).mean() (real_pred and fake_pred are discriminator outputs/logits for real and fake images) and for generator: F.softplus(-fake_pred).mean(), the non-saturating variant, as pointed out by Goodfellow et al. in the original GAN paper.

Without going into theoretical details, losses for Wasserstein GAN - WGAN are the following: discriminator: (fake_pred - real_pred).mean(), generator: (-fake_pred).mean().

My question is why would they use WGAN loss for the generator, and classical GAN loss for discriminator (for code references use links from my initial post), because mathematically these two GAN frameworks are very different.

lukau2357 avatar Mar 11 '25 21:03 lukau2357

The difference between AutoencoderKL and a typical VAE is that you cannot actually sample from a Gaussian distribution and input it into the decoder to generate an image. AutoencoderKL relies on a Diffusion Model to generate images, in this sense, it is an AE rather than a VAE. If you read the paper, you can see that the coefficient of the KL regularization term is 1e-6. From the training loss, it can also be observed that the KL term accounts for a very small proportion of the total loss.

I still can't understand why you think there is a significant issue with the difference in the discriminator loss. In my understanding, they are decoupled and just need to meet a certain correlation.

lavinal712 avatar Mar 12 '25 02:03 lavinal712