Bin Lin (林彬)

Results 15 issues of Bin Lin (林彬)

thank you for your excellent work, I want to know where can download the 'vit_base_patch8_384.pth', thanks again!

### Discussion Hello, esteemed LLaVA developer, thank you for contributing such robust code and data to the community. We have extended LLaVA to [MoE-LLaVA](https://github.com/PKU-YuanGroup/MoE-LLaVA), which with just **3B sparsely activated...

- 项目名称: MoE-LLaVA:大型视觉语言模型的混合专家模型 - 项目地址: Github: https://github.com/PKU-YuanGroup/MoE-LLaVA Paper: https://arxiv.org/abs/2401.15947 Demo: https://huggingface.co/spaces/LanguageBind/MoE-LLaVA - 项目简介 (**100** 字以内): MoE-LLaVA只有3B个稀疏激活参数,表现与LLaVA-1.5-7B在各种视觉理解数据集上相当,并且在物体幻觉基准测试中甚至超越了LLaVA-1.5-13B。通过MoE-LLaVA,我们旨在建立稀疏LVLMs的基准,并为未来研究开发更高效和有效的多模态学习系统提供宝贵的见解。并且MoE-LLaVA团队已经开放了所有的数据、代码和模型。 - 项目截图 (**6**张以内): ![intro0](https://github.com/GitHubDaily/GitHubDaily/assets/62638829/7fb8f027-6580-4155-b9b7-5e37bbc14350) ![intro](https://github.com/GitHubDaily/GitHubDaily/assets/62638829/67039139-9739-42e4-9f3f-742dc0d112c4) ![framework](https://github.com/GitHubDaily/GitHubDaily/assets/62638829/6de2348e-5cac-4282-9bf7-530dcfe20bb2) ![imagecli](https://github.com/GitHubDaily/GitHubDaily/assets/62638829/b9653552-63b2-4d79-931f-7e03f0a4f65e) https://github.com/GitHubDaily/GitHubDaily/assets/62638829/586ae7ab-463a-403c-a4fb-2fd8f47d91bc ![moe-llava](https://github.com/GitHubDaily/GitHubDaily/assets/62638829/8c635772-8a22-41ce-867c-cf91b0387dba)

项目名称: Video-LLaVA & LanguageBind:北大ChatLaw课题组开源五模态大模型、视频大模型,视频问答新SOTA! 项目地址: https://github.com/PKU-YuanGroup/LanguageBind https://github.com/PKU-YuanGroup/Video-LLaVA 项目简介 (100 字以内): ChatGPT浪潮展现出了人们对于通用人工智能(AGI)的期望,受到业界广泛关注,我们也做了以ChatLaw为代表的语言模型,受到一致好评。然而,由于单纯的文本语言模型不足以解决AGI的所有场景,所以我们决定在图片、视频、音频等多模态大模型领域持续展开工作。 我们将其他模态的信息通过几个全连接层映射成类似文本的token,让LLM可以理解视觉信号。 我们首先提出了LanguageBind这个五模态大模型并会于不久之后开源该五模态数据集。 接着我们将五个模态绑定到语言空间,训练出视频大模型——Video-LLaVA。该框架使得一个LLM能够同时接收图片和视频为输入。在视频任务上刷榜多项榜单,这项工作关注到统一LLM的输入能让LLM的视觉理解能力提升。 所有代码全开源! ## 一、LanguageBind ![](https://raw.githubusercontent.com/PKU-YuanGroup/LanguageBind/main/assets/sota.jpg) ![](https://raw.githubusercontent.com/PKU-YuanGroup/LanguageBind/main/assets/languagebind_frame.jpg) ![](https://raw.githubusercontent.com/PKU-YuanGroup/LanguageBind/main/assets/languge_result.jpg) ## 二、Video-LLaVA ![](https://raw.githubusercontent.com/PKU-YuanGroup/Video-LLaVA/main/assets/main.jpg) [](https://github-production-user-asset-6210df.s3.amazonaws.com/62638829/284110937-71ab15ac-105e-4b18-b0b5-e1b35d70607b.mp4) ![](https://github.com/PKU-YuanGroup/Video-LLaVA/blob/main/assets/video_llava_result.jpg?raw=true)

wonder work! but I'm confuse about that why use pooling token instead of [CLS] token? Is performance become worse? or anything I miss?

![image](https://user-images.githubusercontent.com/62638829/224203714-40745efa-c115-4ed0-9b5c-389f7507d6de.png) I have 1 video (already encoded offline as 16 images), 1 image (encoded in .pth), 1 title, 8 captions, and I use a custom decode function to decode them....

faq

I am a freshman in RLHF, so will you upload a demo code of RLHF? or slide file of RLHF?

enhancement

I want to mask part of the image and then rebuild it, not randomly.

In appendix A.2 it is mentioned that the class label is concat as another input to the padded feature. I would like to ask how to encode from a text...