ml-aim icon indicating copy to clipboard operation
ml-aim copied to clipboard

Implementation Details

Open nasosger opened this issue 10 months ago • 4 comments

Hi, First of all thanks for sharing this incredible project! Regarding the implementation of the generative pretraining, I would like to ask for some clarifications. Your paper is very detailed with emphasis on all the hyperparameters, but I think these two things are missing.

  1. How is the prefix length sampled? Is it constant for each batch, or is it specific for each sample in the batch? Are there any minimum, maximum values for it?
  2. In the paper, you mention that the MLP head (that predicts the normalized pixel values), consists of 12 MLP blocks. Does this mean that each MLP block is similar to the MLP block from the attention layer? Does the head include residual connections as well?

Thanks again for sharing! If you could clarify these things, it would be very helpful for other researchers who seek to build upon this nice work.

nasosger avatar Mar 16 '25 20:03 nasosger

Hi! Thank you for the interest. Not sure if you are referring to AIMv1 or AIMv2.

  1. It is sampled uniformly between [1, seq_len - 1]
  2. For AIMv1 it is a stack of MLPs with the hyperparameters described in the paper. Yes, they have residual connections. For AIMv2 it's a shared multimodal transformer decoder (hyperparams described in the paper).

DonkeyShot21 avatar Mar 17 '25 09:03 DonkeyShot21

Thank you very much for the reply @DonkeyShot21 ! I was referring to AIMv1, my apologies for not stating it explicitly.

  1. Perfect! Do you sample a prefix length independently for each sample in the batch, or a prefix length that is the same for all the samples in the batch?
  2. Forgive me if I am wrong, but in the paper the only thing stated is the total number of MLP blocks, and the expansion factor. Are the MLPs same in architecture with the ones in the transformer layers? (e.g. the activation functions or the number of layers).

I appreciate your help! thank you.

nasosger avatar Mar 17 '25 10:03 nasosger

Hi @nasosger.

  1. We sample a prefix length that is independent for each sample in the batch (e.g. batch-size 64 will result in 64 different masks)
  2. Yes, the MLPs in the decoder are identical in terms of design to the ones in the main transformer trunk.

aelnouby avatar Mar 17 '25 10:03 aelnouby

Thank you guys! Much appreciated. Are there any plans to release the pretraining code as well?

nasosger avatar Mar 17 '25 15:03 nasosger