Stan Lei
Stan Lei
You may find [ViT-Lens](https://github.com/TencentARC/ViT-Lens) of interests, which works with MLLM to generate texts or images from other modalities :)
not sure if it is because the dog/car/bird cases do not appear in the training set of ImageBind
Hi there, I recommend you to check out our project [ViT-Lens](https://github.com/TencentARC/ViT-Lens). For the depth experiments, we obtained better performance over ImageBind on the same testing data. Hope that helps.
Hi there, I recommend you to checkout our project [ViT-Lens](https://github.com/TencentARC/ViT-Lens). We open-sourced the training code and you may take a look at the audio part for your customized application.
Thank you for pointing this out -- it is important to figure this out for a more general depth model. As such, could you please also check [LanguageBind](https://github.com/PKU-YuanGroup/LanguageBind) and their...
Got it, thanks @jbrownkramer! I will look into this.
Hi, please check #11. For integration, we used the same ViT in InstructBLIP/SEED for ViT-Lens training. FYI, this [ckpt](https://huggingface.co/TencentARC/ViT-Lens/blob/main/eva_g14_objaverse.pt) in huggingface is for 3D integration. I will upload ckpts for...
Thank you so much for pointing this out and for your insightful suggestion! I will look into this and experiment with this normalization you mentioned in the depth-related experiments, to...
@jbrownkramer Thank you for your comments. If possible, could you please provide your implementation as you mentioned so that I can find some time later to conduct experiments on this?...
Same issue. Even with '--use-te-layernorm-linear' for converting CLIP, the error still exists.