[Question] Janus text-to-image training code
Required prerequisites
- [x] I have read the documentation https://align-anything.readthedocs.io.
- [x] I have searched the Issue Tracker and Discussions that this hasn't already been reported. (+1 or comment there if it has.)
- [ ] Consider asking first in a Discussion.
Questions
I use the janus text-to-image training code and I have some questions about detailed code.
First, I use the janus code named Align_Anything_Janus from your recommended repo, and in ../Align_Anything_Janus/janus/models/modeling_vlm.py file , I found that your training code
elif task == "generation":
image_token_num_per_image = 576
cfg_weight = 5
temperature = 1
tokens = torch.zeros((2*input_ids.size(0), input_ids.size(1)), dtype=torch.int).cuda()
for i in range(2):
tokens[i*input_ids.size(0):(i+1)*input_ids.size(0), :] = input_ids
if i % 2 != 0:
tokens[i*input_ids.size(0):(i+1)*input_ids.size(0), 1:-1] = 100015 # pad_id
inputs_embeds = self.language_model.get_input_embeddings()(tokens)
print("Embedding size:", self.language_model.get_input_embeddings().weight.size(0))
print("Max token id in input_ids:", input_ids.max())
outputs = self.language_model.model(inputs_embeds=inputs_embeds, use_cache=True, past_key_values=None)
hidden_states = outputs.last_hidden_state
logits = self.gen_head(hidden_states)
logits_cond = logits[0::2, :]
logits_uncond = logits[1::2, :]
all_logits = logits_uncond + cfg_weight * (logits_cond - logits_uncond)
For this , input_ids contain text and image ids, but you seem to process image token ids using text embedding processor in inputs_embeds = self.language_model.get_input_embeddings()(tokens), so I want to know why , and I think if it should use the mmgpt.prepare_gen_img_embeds provided by janus.
Maybe I am not right , but I really want to know why to handle the tokens and why just run self.language_model.get_input_embeddings()(tokens) just once?
And I read the paper of Janus, it said use image adpaptor to map token ids to embedding , the following is sentence:
After the ID sequence is flattened into 1-D, we use a generation adaptor to map the codebook
embeddings corresponding to each ID into the input space of the LLM. We then concatenate
these feature sequences to form a multimodal feature sequence, which is subsequently fed into
the LLM for processing.
And the training picture is as follows
yes, you are right, i agree with u, should use this mmgpt.prepare_gen_img_embeds to embed image token id, and use self.language_model.get_input_embeddings()(tokens) just for text token id
and go to adaptor respectively, and concat then pass to LLM
Yes, but it seems that the repo does not do this, and I really wish the authors can fix it
and, !!!
100015 is not pad id, it is 100002 = =
actually, when pre_tokenizing, we could save the pixel_values of vq encode, so we just use it here instead of getting embedding again
or say, text use tokenizer to get embedding, image use pixel_values saved directly
@htlou could u please see this issue ? thx ~~~ 我理解的是否正确~~
Sorry for the delay in my reply! I was busy fixing existing (and confirmed) issues in current Janus implementation.
100015 is not pad id, it is 100002 = =
According to the huggingface repo deepseek-ai/Janus-1.3B/blob/main/tokenizer_config.json, the pad token id IS 100015. I also checked the config of Janus-Pro, and the pad token id is also 100015. Perhaps you confused it with other models?
For this , input_ids contain text and image ids, but you seem to process image token ids using text embedding processor in inputs_embeds = self.language_model.get_input_embeddings()(tokens), so I want to know why , and I think if it should use the mmgpt.prepare_gen_img_embeds provided by janus.
I will look into this issue soon, and if there IS something wrong, I will add the fix into #197.
@htlou thanks for ur reply, dont sorry for ur delay, i am very happy if u can reply(你能回复我已经很开心了!🤣)
first question, the vocabulary of janus pro 1b and 7b is different, in janus pro 1b , pad token id is 100002, but in janus pro 7b, it is 100015
refer to this picture, it is a part of vocabulary of janus pro 1b
second question, you can refer the official inference code, when new token inferred, it goes to gen_embed and gen_aliner(two layers MLP) to get image embedding,prepare_gen_img_embeds function in modeling_vlm.py
first question, the vocabulary of janus pro 1b and 7b is different, in janus pro 1b , pad token id is 100002, but in janus pro 7b, it is 100015
Well, I only tested Janus-1.3B, Janus-7B, and Janus-Pro-7B on align-anything, and I ignored Janus-Pro-1B... Will incorporate this into #197.
first question, the vocabulary of janus pro 1b and 7b is different, in janus pro 1b , pad token id is 100002, but in janus pro 7b, it is 100015
Well, I only tested Janus-1.3B, Janus-7B, and Janus-Pro-7B on align-anything, and I ignored Janus-Pro-1B... Will incorporate this into #197.
@htlou yeah! by the way, the author of janus is Peking University as well, do u have connection with them ? (你们是一个实验室的师兄弟嘛?) . Janus has no official training code published, so i wonder if u can get the original training code of janus, hhhhhh