Why not adopt bert like MaskGIT to reconstruct Tokens?
Dear author, Thanks for sharing the code. I am greatly interested in your work. I have a question for you and would like your reply. In the second stage, you adopt an encoder-decoder Transformer to reconstruct Tokens. Why not directly adopt the bidirectional Transformer in MaskGIT. Therefore, I want to know what are the advantages of the encoder-decoder Transformer.
Waiting for your reply!
In fact we started from MaskGIT's BERT architecture, but we find both linear probing and unconditional generation performance are poor (57.4% accuracy, 20.7 FID). Then we find that adopting the encoder-decoder architecture similar to MAE can largely improve the performance. My assumption is that such an encoder-decoder design is better for representation learning, and such a good representation can then also help with generation.
Thanks for your reply! But I have another question. In the second stage, will better results be obtained if the masked images are adopted as input to reconstruct the tokens? Table 4 of your paper shows that scratches on pixels lead to better performance.
We must use image tokens as both input and output to enable image generation, because image generation takes multiple steps. In the middle of generation, only part of the tokens are generated which cannot decode to images. If we only consider representation learning, using masked images as input and tokens as output is similar to BeiT.
I got it!Thanks!