VideoChat2-text basline
Hello,
Thank you for putting out this amazing set of models, datasets and evals! Is it possible to release the code and details for the VideoChat2-text baseline from your paper? I am studying some properties of video understanding and benchmarks and this baseline might be important!
Best, Benno
Good question! In my experiments, I just input the zero image for VideoChat2-text. For example, changing the code in the demo.ipynb
img_list.append(image_emb) # original image
img_list.append(torch.zeros_like(image_emb)) # zero image
Thank you! I will try that out
@Andy1621 In the paper you say: "“VideoChat2text” denotes the model receiving blank videos and excludes LoRA tuning, relying solely on the LLM’s capacity for responses". Does that mean you ran the stage 2 model and not stage 3? Do you have any more details on whether the rest was the same such as prompting?
I ran both Stage 2 and Stage 3 models in our pipeline, but with zero-out video input, but the results look quite different to the paper unfortunately. We were able to reproduce the normal however up to 1-2% margin.
We would be very grateful if you can share any more details!
@Andy1621 In the paper you say: "“VideoChat2text” denotes the model receiving blank videos and excludes LoRA tuning, relying solely on the LLM’s capacity for responses". Does that mean you ran the stage 2 model and not stage 3? Do you have any more details on whether the rest was the same such as prompting?
Hi! Actually, we run Stage3 model, which is training without LoRA for a fair comparison.
Interesting, thank you! Do you still have the weights somewhere for this?
Please try this model without LoRA. I just found it from the previous model weights and haven't tested it~
Hi, we will close this issue.
Feel free to contact us if you have other questions.