Query About Figure 3 in the MAGE Article
Hello, following your suggestion yesterday, I studied another one of your articles, MAGE. However, I'm curious about something that might not have been clearly explained. Figure 3 shows the unmasked tokens and the fake class token after passing through the encoder. However, the article mentions that tokens are masked at a ratio of 0.5-1 and some of these tokens are deleted as input to the encoder. Why is there this discrepancy? Is it a misunderstanding on my part, or is there an issue with the illustration in the article?
The implementation detail is as follows: during training, a masking ratio (mr) between 0.5-1 is sampled for each iteration to mask out the input image tokens. Since mr is always larger than 0.5, we can always drop 50% of all tokens from the masked tokens (this follows MAE which significantly reduces training costs). Therefore, the training input to the encoder will consist of mr-0.5 masked tokens and 1-mr original tokens. During inference (generation), we do not drop any tokens, and start from 100% masked tokens to iteratively decode the image.