VAR is it make sense to try image loss in stage 2?

I tried to use gumbel-softmax to transfer latent prediction to image for several loss(like perceptual loss、l1、adv , tried aims to achieve several tasks that crossentropyloss might not suit well. ) in stage2(transformer train period), all of them seemed not work. I wonder if my thoughts was wrong. Thanks for ur excellent work!!

Jan 09 '25 03:01 AlexzQQQ

also i find sometimes used gumbel-softmax(hard=false) will show a better result for train,but it is a bad set for use hard=false? if codebook is limited, will mixed token show more performance ? my ability doesn't support my question and i could'n find useful research .Thanks for ur excellent work again!!

Jan 09 '25 03:01 AlexzQQQ

also i find sometimes used gumbel-softmax(hard=false) will show a better result for train,but it is a bad set for use hard=false? if codebook is limited, will mixed token show more performance ? my ability doesn't support my question and i could'n find useful research .Thanks for ur excellent work again!!

@AlexzQQQ Have you ever tried using only two predicted probability within one token to mix for each latent embedding? I mean, just mask out the non-top-2 probability and use gumbel-softmax on the predicted top-2 probability to produce latent embedding and decode it using VQ-VAE. This setup simplifies your first question whether the performance drop because of mixing a massive amount of codes. As your second question, what's your evaulation result when using gumbel-softmax(hard=false)? And what's your temperature setup? Using gumbel-softmax(hard=false) will incorporate multiple token probability when producing a latent embedding, and it's hard to tell and analyze if it's a good move without any evaluation result.

I think although it's adventageous to mix token probability given each code represent distinct feature in theory, whether the pretrained VQ-VAE can fully utilize the rich latent repretentation produce by the mixed token probability or it will simply collapse eventually in the 2nd stage of training is worth exploring.

Jan 16 '25 02:01 jack111331

@jack111331 thx for u reply , I will try what your questioned in serveral weeks due to busy work. the mix token will useful if loss is not only crossentropy loss in my opinion.

Jan 16 '25 09:01 AlexzQQQ

@AlexzQQQ @jack111331 I have some practical questions. How to add image loss in stage two? Should the image loss be applied to the output image after performing 'idx = gumbel_softmax(logits) and image = vqvae.idxBl_to_img(idx)'? I displayed the output image and found that the quality is very poor. Did I do something wrong? Thx for the relpy very much!

Mar 20 '25 09:03 Longhzzz

@Longhzzz I use gumbel_softmax for different scales token that mix into the final 16*16 token map(which then through frozen decoder into img ,then use img to apply image loss. I think its useless to try in these part:) hope my reply could help u

Mar 31 '25 15:03 AlexzQQQ