ViT-Lens icon indicating copy to clipboard operation
ViT-Lens copied to clipboard

combining modalities

Open bakachan19 opened this issue 10 months ago • 0 comments

Hi, thanks for this amazing work.

Could you share a demo code on how to combine different modalities into a single image, as mentioned in the paper: Moreover, the model demonstrates the capability to intake inputs from various modalities and subsequently generate an image that combines all the conveyed concepts in a coherent manner. In practice, we employ the prompt “[input tokens A], [input tokens B], please generate an image to combine them” to facilitate this process. For a visual examples, please refer to Fig. 6-(E) in the main paper.

Thank you so much for your help! Have a great day!

bakachan19 avatar Mar 26 '25 13:03 bakachan19