Summer
Summer
when i python demo.py, i meet an error below, is there anyone can help me? Traceback (most recent call last): File "demo.py", line 3, in from inpaint import inpaint File...
`class CLIPTransformer(nn.Module): def __init__(self, config: Config): super(CLIPTransformer, self).__init__() self.config = config if self.config.huggingface: from transformers import CLIPModel self.clip = CLIPModel.from_pretrained(self.config.clip_type) # downloading pytorch_model.bin 577M else: from model.clip_model import load_clip self.clip...
Where can we download the raw videos of TVR dataset.
Excuse me,thank you for your excellent project. I have successfully reproduced it on two datasets, MSRVTT and MSVD. But when I reproduced it on the LSMDC dataset, the performance was...
Thank you for your in interesting work and your shared code! I'm very confused that whether the zero-shot performance on MSRVTT reported in [here](https://github.com/OpenGVLab/InternVideo/tree/main/Downstream/Video-Text-Retrieval#our-results) requires setting “--mergeclip=True”? Below is the...
Hello, thank you for your excellent work MixGen and code sharing. I have two questions that I would like to call for your help: 1. In addition to being used...
Hi, I am new in this field and I am very interested in your excellent research. In your paper, you mentioned > _To compare the effect of template diversity on...
Hello, thank you for sharing. I have a question. Why is the dataset given in this repository [MLVU-Dev](https://huggingface.co/datasets/MLVU/MVLU) different from the dataset used by lmms-eval ([sy1998/MLVU_dev](https://huggingface.co/datasets/sy1998/MLVU_dev/tree/main))? Is there any difference...
按照quickstart安装依赖之后,进行video inference。脚本1可以正常推理,脚本2则会报错,两者的区别只有在Qwen2_5_VLForConditionalGeneration.from_pretrained中指定attn_implementation="flash_attention_2"与否。 脚本1如下,模型可以正常推理: ``` import torch from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor from qwen_vl_utils import process_vision_info # default: Load the model on the available device(s) model = Qwen2_5_VLForConditionalGeneration.from_pretrained( "/home/uussee/MLLM/Qwen2.5-VL-7B-Instruct", torch_dtype=torch.bfloat16, device_map="auto"...
在realtime_inference.py里面所有的帧看上去是单独处理的,如何保证帧与帧之间的嘴部动作的连贯性呢。我观察到嘴部的位置有时候会发生漂移