ViT-Lens
ViT-Lens copied to clipboard
combining modalities
Hi, thanks for this amazing work.
Could you share a demo code on how to combine different modalities into a single image, as mentioned in the paper: Moreover, the model demonstrates the capability to intake inputs from various modalities and subsequently generate an image that combines all the conveyed concepts in a coherent manner. In practice, we employ the prompt “[input tokens A], [input tokens B], please generate an image to combine them” to facilitate this process. For a visual examples, please refer to Fig. 6-(E) in the main paper.
Thank you so much for your help! Have a great day!