Emu icon indicating copy to clipboard operation
Emu copied to clipboard

Question about training

Open HongyanZhi opened this issue 2 years ago • 2 comments

Thanks for your great work first! I found that the code uses "emu_encoder.decoder.lm.generate()" to produce text response and uses "emu_encoder.decoder.lm.model()" to produce latent image embeddings. So how can I output both the text and image embedding to reproduce your training process? Or training is the first to use "emu_encoder. decoder. Lm. generate ()" to generate the text and then using "emu_encoder.decoder. Lm. model ()" to generate the image embedding? Thanks for you reply!

HongyanZhi avatar Sep 21 '23 03:09 HongyanZhi

Hello. Thanks for your interest in our work. For each training example, we generate the embeddings only once. Note that for text loss we also first generate the embeddings, then compute the classification (Cross-Entropy) loss. Image loss is computed at the same place, but using the regression instead of the classification objective.

yqy2001 avatar Oct 13 '23 21:10 yqy2001

Thanks for your reply!
I have another 2 questions:

  1. https://github.com/baaivision/Emu/blob/9671c371105f151eee60c48ac6738407238bd20c/models/pipeline.py#L115C29-L115C29 . If I use classify free guidance , noise_pred_uncond should be forward without encoder_hidden_states ? It is not the same as current code.
  2. https://github.com/baaivision/Emu/blob/9671c371105f151eee60c48ac6738407238bd20c/models/modeling_llama.py#L234 Why is the lables has 33 tokens instead of 32 tokens as the paper say?

Many thanks for your reply!

Hoyyyaard avatar Oct 19 '23 09:10 Hoyyyaard