T2I-Adapter
T2I-Adapter copied to clipboard
About style representation
Hi! thanks for the great work : )
I wonder what type of representation is used when training the style adapter? Is it CLIP image embedding? If yes, then how to make sure the content (semantics) is disentangled from the style? Thanks in advance!
Yes, we use the tokens output by the vision encoder of CLIP as conditions. The reason why it does not introduce content information is that the content information of the entire image is encoded into a small number of tokens, and the capacity of these tokens is not sufficient to store the content details of the entire image. Therefore, it is a global representation that contains style information.