About style representation

Open cats-food opened this issue 2 years ago • 1 comments

Hi! thanks for the great work : )

I wonder what type of representation is used when training the style adapter? Is it CLIP image embedding? If yes, then how to make sure the content (semantics) is disentangled from the style? Thanks in advance!

Jul 18 '23 08:07 cats-food

Yes, we use the tokens output by the vision encoder of CLIP as conditions. The reason why it does not introduce content information is that the content information of the entire image is encoded into a small number of tokens, and the capacity of these tokens is not sufficient to store the content details of the entire image. Therefore, it is a global representation that contains style information.

Jul 26 '23 08:07 MC-E