align-anything [Question] Janus text-to-image training code

Required prerequisites

[x] I have read the documentation https://align-anything.readthedocs.io.
[x] I have searched the Issue Tracker and Discussions that this hasn't already been reported. (+1 or comment there if it has.)
[ ] Consider asking first in a Discussion.

Questions

I use the janus text-to-image training code and I have some questions about detailed code.

First, I use the janus code named Align_Anything_Janus from your recommended repo, and in ../Align_Anything_Janus/janus/models/modeling_vlm.py file , I found that your training code

elif task == "generation":
            image_token_num_per_image = 576
            cfg_weight = 5
            temperature = 1
            tokens = torch.zeros((2*input_ids.size(0), input_ids.size(1)), dtype=torch.int).cuda()
            for i in range(2):
                tokens[i*input_ids.size(0):(i+1)*input_ids.size(0), :] = input_ids
                if i % 2 != 0:
                    tokens[i*input_ids.size(0):(i+1)*input_ids.size(0), 1:-1] = 100015 # pad_id

            inputs_embeds = self.language_model.get_input_embeddings()(tokens)
            print("Embedding size:", self.language_model.get_input_embeddings().weight.size(0))
            print("Max token id in input_ids:", input_ids.max())
            outputs = self.language_model.model(inputs_embeds=inputs_embeds, use_cache=True, past_key_values=None)
            hidden_states = outputs.last_hidden_state
            logits = self.gen_head(hidden_states)
            logits_cond = logits[0::2, :]
            logits_uncond = logits[1::2, :]

            all_logits = logits_uncond + cfg_weight * (logits_cond - logits_uncond)

For this , input_ids contain text and image ids, but you seem to process image token ids using text embedding processor in inputs_embeds = self.language_model.get_input_embeddings()(tokens), so I want to know why , and I think if it should use the mmgpt.prepare_gen_img_embeds provided by janus. Maybe I am not right , but I really want to know why to handle the tokens and why just run self.language_model.get_input_embeddings()(tokens) just once?

Apr 21 '25 07:04 miyapeng

And I read the paper of Janus, it said use image adpaptor to map token ids to embedding , the following is sentence:

After the ID sequence is flattened into 1-D, we use a generation adaptor to map the codebook
embeddings corresponding to each ID into the input space of the LLM. We then concatenate
these feature sequences to form a multimodal feature sequence, which is subsequently fed into
the LLM for processing.

And the training picture is as follows

Apr 21 '25 13:04 miyapeng

yes, you are right, i agree with u, should use this mmgpt.prepare_gen_img_embeds to embed image token id, and use self.language_model.get_input_embeddings()(tokens) just for text token id

Apr 21 '25 13:04 hl0737

and go to adaptor respectively, and concat then pass to LLM

Apr 21 '25 13:04 hl0737

Yes, but it seems that the repo does not do this, and I really wish the authors can fix it

Apr 21 '25 13:04 miyapeng

and, !!!

100015 is not pad id, it is 100002 = =

Apr 21 '25 13:04 hl0737

actually, when pre_tokenizing, we could save the pixel_values of vq encode, so we just use it here instead of getting embedding again

Apr 21 '25 13:04 hl0737

or say, text use tokenizer to get embedding, image use pixel_values saved directly

Apr 21 '25 13:04 hl0737

@htlou could u please see this issue ? thx ~~~ 我理解的是否正确~~

Apr 29 '25 15:04 hl0737

Sorry for the delay in my reply! I was busy fixing existing (and confirmed) issues in current Janus implementation.

100015 is not pad id, it is 100002 = =

According to the huggingface repo deepseek-ai/Janus-1.3B/blob/main/tokenizer_config.json, the pad token id IS 100015. I also checked the config of Janus-Pro, and the pad token id is also 100015. Perhaps you confused it with other models?

For this , input_ids contain text and image ids, but you seem to process image token ids using text embedding processor in inputs_embeds = self.language_model.get_input_embeddings()(tokens), so I want to know why , and I think if it should use the mmgpt.prepare_gen_img_embeds provided by janus.

I will look into this issue soon, and if there IS something wrong, I will add the fix into #197.

Apr 30 '25 05:04 htlou

@htlou thanks for ur reply, dont sorry for ur delay, i am very happy if u can reply（你能回复我已经很开心了！🤣）

first question, the vocabulary of janus pro 1b and 7b is different, in janus pro 1b , pad token id is 100002, but in janus pro 7b, it is 100015

refer to this picture, it is a part of vocabulary of janus pro 1b

second question, you can refer the official inference code, when new token inferred, it goes to gen_embed and gen_aliner(two layers MLP) to get image embedding，prepare_gen_img_embeds function in modeling_vlm.py

Apr 30 '25 05:04 hl0737

first question, the vocabulary of janus pro 1b and 7b is different, in janus pro 1b , pad token id is 100002, but in janus pro 7b, it is 100015

Well, I only tested Janus-1.3B, Janus-7B, and Janus-Pro-7B on align-anything, and I ignored Janus-Pro-1B... Will incorporate this into #197.

Apr 30 '25 05:04 htlou

first question, the vocabulary of janus pro 1b and 7b is different, in janus pro 1b , pad token id is 100002, but in janus pro 7b, it is 100015

Well, I only tested Janus-1.3B, Janus-7B, and Janus-Pro-7B on align-anything, and I ignored Janus-Pro-1B... Will incorporate this into #197.

@htlou yeah! by the way, the author of janus is Peking University as well, do u have connection with them ? (你们是一个实验室的师兄弟嘛？) . Janus has no official training code published, so i wonder if u can get the original training code of janus, hhhhhh

Apr 30 '25 05:04 hl0737