Summer issues

Results 10 issues of


                                            Summer

demo.py error

when i python demo.py, i meet an error below, is there anyone can help me? Traceback (most recent call last): File "demo.py", line 3, in from inpaint import inpaint File...

How can i get every word feature in the text instand of the text feature?

`class CLIPTransformer(nn.Module): def __init__(self, config: Config): super(CLIPTransformer, self).__init__() self.config = config if self.config.huggingface: from transformers import CLIPModel self.clip = CLIPModel.from_pretrained(self.config.clip_type) # downloading pytorch_model.bin 577M else: from model.clip_model import load_clip self.clip...

Where can we download the raw videos of TVR dataset.

Have you compressed the LSMDC dataset?

Excuse me，thank you for your excellent project. I have successfully reproduced it on two datasets, MSRVTT and MSVD. But when I reproduced it on the LSMDC dataset, the performance was...

Confusion about zero-shot setting on Video-Text Retrieval

Thank you for your in interesting work and your shared code! I'm very confused that whether the zero-shot performance on MSRVTT reported in [here](https://github.com/OpenGVLab/InternVideo/tree/main/Downstream/Video-Text-Retrieval#our-results) requires setting “--mergeclip=True”? Below is the...

Need your help.

Hello, thank you for your excellent work MixGen and code sharing. I have two questions that I would like to call for your help: 1. In addition to being used...

What dose "inverting 3 out of 10 templates" means?

Hi, I am new in this field and I am very interested in your excellent research. In your paper, you mentioned > _To compare the effect of template diversity on...

What is the difference between sy1998/MLVU_dev and MLVU/MVLU?

Hello, thank you for sharing. I have a question. Why is the dataset given in this repository [MLVU-Dev](https://huggingface.co/datasets/MLVU/MVLU) different from the dataset used by lmms-eval ([sy1998/MLVU_dev](https://huggingface.co/datasets/sy1998/MLVU_dev/tree/main))? Is there any difference...

QWen2.5-VL使用flash-attn进行video inference出错

按照quickstart安装依赖之后，进行video inference。脚本1可以正常推理，脚本2则会报错，两者的区别只有在Qwen2_5_VLForConditionalGeneration.from_pretrained中指定attn_implementation="flash_attention_2"与否。脚本1如下，模型可以正常推理： ``` import torch from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor from qwen_vl_utils import process_vision_info # default: Load the model on the available device(s) model = Qwen2_5_VLForConditionalGeneration.from_pretrained( "/home/uussee/MLLM/Qwen2.5-VL-7B-Instruct", torch_dtype=torch.bfloat16, device_map="auto"...

嘴唇跳动，帧间不够连贯

在realtime_inference.py里面所有的帧看上去是单独处理的，如何保证帧与帧之间的嘴部动作的连贯性呢。我观察到嘴部的位置有时候会发生漂移