latent-diffusion Questions about reconstruction and adversarial losses

AutoencoderKL ends up using L1 reconstruction error instead of L2 reconstruction error during training, which does not coincide with classical VAEs theoretically - data likelihood conditioned on latents $p(x | z)$ is Gaussian so taking the log of Gaussian PDF gives L2 reconstruction error, up to a scaling factor. (https://github.com/CompVis/latent-diffusion/blob/main/ldm/modules/losses/contperceptual.py#L48)
Same question for VQ-GAN, although I can see that it supports both L2 and L1 regimes, with L1 being the default one. https://github.com/CompVis/latent-diffusion/blob/main/ldm/modules/losses/vqperceptual.py#L103
Generator loss for both autoencoders is computed as -torch.mean(logits_fake). (https://github.com/CompVis/latent-diffusion/blob/main/ldm/modules/losses/vqperceptual.py#L123), https://github.com/CompVis/latent-diffusion/blob/main/ldm/modules/losses/contperceptual.py#L71). Correct me if I'm wrong, but I think this corresponds to generator loss under WGAN framework, but discriminator loss only supports ~~non-saturating~~ vanilla discriminator loss and hinge discriminator loss. https://github.com/CompVis/latent-diffusion/blob/main/ldm/modules/losses/contperceptual.py#L27), https://github.com/CompVis/latent-diffusion/blob/main/ldm/modules/losses/vqperceptual.py#L73, https://github.com/CompVis/taming-transformers/blob/master/taming/modules/losses/vqperceptual.py#L20

Jan 30 '25 17:01 lukau2357

As far as I know, AutoencoderKL is more of an AE (Autoencoder) than a VAE (Variational Autoencoder). Therefore, in the original code, the difference between L1 and L2 is not very significant. Additionally, StabilityAI later developed a new version of AutoencoderKL that uses MSE as the loss function.

Can you elaborate on the doubts regarding the generator loss?

Mar 11 '25 01:03 lavinal712

I agree that L1 is sensible for classical autoencoders, but I'm pretty sure that AutoencoderKL class actually implements a VAE, because they map an image to a Gaussian, and then they use reparameterization trick in the decoder. In particular, for the forward pass of AutoencoderKL: mapping to a Gaussian, calling reparameterization trick in the decoder, code for reparameterization trick, so my original question about AutoencoderKL still stands - when prior $p(z)$, approximate posterior $p_{\theta}(z \mid x)$ and conditional data likelihood $p_{W}(x \mid z)$ are all Gaussians one should use L2 instead of L1, along with KL divergence component.

Classical/Vanilla GAN losses are the following, for discriminator: (F.softplus(-real_pred) + F.softplus(fake_pred)).mean() (real_pred and fake_pred are discriminator outputs/logits for real and fake images) and for generator: F.softplus(-fake_pred).mean(), the non-saturating variant, as pointed out by Goodfellow et al. in the original GAN paper.

Without going into theoretical details, losses for Wasserstein GAN - WGAN are the following: discriminator: (fake_pred - real_pred).mean(), generator: (-fake_pred).mean().

My question is why would they use WGAN loss for the generator, and classical GAN loss for discriminator (for code references use links from my initial post), because mathematically these two GAN frameworks are very different.

Mar 11 '25 21:03 lukau2357

The difference between AutoencoderKL and a typical VAE is that you cannot actually sample from a Gaussian distribution and input it into the decoder to generate an image. AutoencoderKL relies on a Diffusion Model to generate images, in this sense, it is an AE rather than a VAE. If you read the paper, you can see that the coefficient of the KL regularization term is 1e-6. From the training loss, it can also be observed that the KL term accounts for a very small proportion of the total loss.

I still can't understand why you think there is a significant issue with the difference in the discriminator loss. In my understanding, they are decoupled and just need to meet a certain correlation.

Mar 12 '25 02:03 lavinal712