Question in paper: why the text-prompt embeding uses the penultimate text embeddings of a CLIP ViT-H/14 text-encoder?

Open LokiXun opened this issue 2 years ago • 0 comments

Hi, I am wondering why the Prompt-embedding in StableDiffusion extracted from the penultimate layer of CLIP ViT-H/14 text-encoder? Why not using the origin clip feature just like the image feature from CLIP image-encoder? It seems like causing the shape mismatch from the CLIP image feature, which makes simply replace texrt-embedding with CLIP image embedding without modifying model not feasible (Is that true?). I am curious how to use image as condition for cross-attn without much changes to the model. Thanks

Nov 04 '23 02:11 LokiXun