DAI

Results 20 comments of DAI

emm.. It will work but it will be just useless since the model will do the exact step during inference since the image has a fix latent size.

But it will be usefull if you want to training on different spatial ratio image and add the information as the start token

Yes, And my another suggest based on my training in many t2i models, add cross attention instead of add the token to the front will produce more promising result.

the PE Is trying to tell the model about the relative position to the image, for example, a pixel should have more relation with the nearly by pixel. But in...

Oh, I see, so the text only make effect via xv?

And it will also mean the text attention mask is useless?

If you only train the VQGAN, then obviously the VQGAN are trainable. If you train the GPT for the image generation, then you only need to trained the GPT model...

The autoregreesive model acutally is not really good for mask guidance image generation in my opinion

If the mask is just like casual mask I think it will be great, but I do not think we always has the casual mask in real life