Emu Question about training

Thanks for your great work first! I found that the code uses "emu_encoder.decoder.lm.generate()" to produce text response and uses "emu_encoder.decoder.lm.model()" to produce latent image embeddings. So how can I output both the text and image embedding to reproduce your training process? Or training is the first to use "emu_encoder. decoder. Lm. generate ()" to generate the text and then using "emu_encoder.decoder. Lm. model ()" to generate the image embedding? Thanks for you reply!

Sep 21 '23 03:09 HongyanZhi

Hello. Thanks for your interest in our work. For each training example, we generate the embeddings only once. Note that for text loss we also first generate the embeddings, then compute the classification (Cross-Entropy) loss. Image loss is computed at the same place, but using the regression instead of the classification objective.

Oct 13 '23 21:10 yqy2001

Thanks for your reply!
I have another 2 questions:

https://github.com/baaivision/Emu/blob/9671c371105f151eee60c48ac6738407238bd20c/models/pipeline.py#L115C29-L115C29 . If I use classify free guidance , noise_pred_uncond should be forward without encoder_hidden_states ? It is not the same as current code.
https://github.com/baaivision/Emu/blob/9671c371105f151eee60c48ac6738407238bd20c/models/modeling_llama.py#L234 Why is the lables has 33 tokens instead of 32 tokens as the paper say?

Many thanks for your reply!

Oct 19 '23 09:10 Hoyyyaard