Ask-Anything Questions about VideoChat2

Hi, thanks for your update of VideoChat2_HD! When trying the newly-released code, I got some questions:

The MetaLoader_rs class in "train_it_ds.py" seems to be missing.
So I still used "train_it.py", but got the following error. I'm not sure whether it could be solved by using MetaLoader_rs.

RuntimeError: stack expects each tensor to be equal size, but got [8, 3, 224, 448] at entry 0 and [8, 3, 448, 672] at entry 1

Then I changed the batch_size to 1 and solved the previous error. But it seems the load_and_transform_media_data_image function does not have dynamic_config setting, which is passed to it in "it_dataset_mistral.py". I created a pull request to modify this part.
Is there any place to find the newly added dataset for VideoChat2_HD? I suppose the datasets are important to improve model performances.

Jun 12 '24 02:06 LiJiaqi96

Thanks for your try! I will fix it later~

Jun 12 '24 03:06 Andy1621

@LiJiaqi96 Please have a try. have updated the code. The train_it_ds is add with deepspeed and need some change.

Jun 12 '24 03:06 Andy1621

Thanks! I tried "train_it_ds.py" without using deepspeed, but it doesn't work. Is it possible to train without using deepspeed? Temporally I prefer not to use deepspeed.

Jun 12 '24 08:06 LiJiaqi96

Yes! You can run it without deepspeed. BTW, show me you log so that I can fix the bug ~

Jun 12 '24 11:06 Andy1621

Sorry for the late reply. The log is here train_log.txt in "config_7b_hd_stage4.py", I set enable=False in deepspeed settings.
and run the code with:

torchrun    --nnodes=${NNODE} --nproc_per_node=${NUM_GPUS} \
    --rdzv_endpoint=${MASTER_NODE}:10068 \
    --rdzv_backend=c10d \
    tasks/train_it_ds.py \
    $(dirname $0)/config_7b_hd_stage4.py \
    output_dir ${OUTPUT_DIR}

Jun 13 '24 07:06 LiJiaqi96

I'm not sure whether it is cause by the deepspeed or pytorch verisons. Here are my versions of different packages:

torch                     1.13.1+cu117
torchaudio                0.13.1+cu117
torchnet                  0.0.4
torchvision               0.14.1+cu117
deepspeed                 0.14.2
transformers              4.40.1

BTW, sometimes you can fix the bug by change find_unused_parameters to True or Fasle.

Jun 13 '24 08:06 Andy1621

Thanks, I will create an environment with exactly the same packages and have a try.

Jun 13 '24 10:06 LiJiaqi96

Hi, I found shared_utils_ds.py has a bug in line 58.

optimizer_params = create_optimizer(config.optimizer, model, return_group=True)

the optimizer.py may need to be updated.

Jun 13 '24 12:06 yuanrr

Thanks for your feedback. I have updated the code.

Jun 13 '24 20:06 Andy1621

I used the new environment except flash-attn, as I used CUDA 12.1 and can only use flash-attn==2.1.0. I ran the code "scripts/videochat_mistral/run_7b_stage4_hd.sh", with "tasks/train_it.py" and deepspeed enable=False, then got error train_log0618.txt. The error seems to be caused by flash-attn.
Is it possible to run videochat2_hd using the same environment as videochat2_mistral, withou using deepspeed?

Jun 18 '24 04:06 LiJiaqi96

BTW I test to run the code on single GPU (like python train_it.py) and it iterates normally

Jun 18 '24 09:06 LiJiaqi96

Yes, it's okay to use it without deepspeed. I use deepspeed ZERO to decrease the GPU memory~

Jun 18 '24 10:06 Andy1621

I see. Is it ok for you to run on multiple GPUs without deepspeed, just as the model runs in videochat2_mistral?

Jun 20 '24 01:06 LiJiaqi96

Update: I managed to solve the previous issue by upgrading the flash-attn to 2.5.9. When I use "train_it_ds.py" and with deepspeed enable=True, I met new issue about deepspeed config: trainlog_0621.txt
Could you please help me solve that?

Jun 21 '24 10:06 LiJiaqi96

Hi! Please try again with the newly commit.

Jun 22 '24 18:06 Andy1621

Thanks for your update! Now the code could run with deepspeed enabled.
BTW, Is there any place to find the newly added dataset for VideoChat2_HD? I suppose the datasets are important to improve model performances.

Jun 24 '24 06:06 LiJiaqi96

Almost all the datasets can be directly downloaded from their repos or homepages~

Give me feedback if you don't find them.

Jun 25 '24 11:06 Andy1621

new_IT_videos In "instruction_data.py", there are some newly added image datasets in M3IT, and some newly added videos datasets. Is there any place to find those video datasets? Thanks!

Jun 26 '24 06:06 LiJiaqi96

These datasets are generated from ShareGPTVideo, VidLN, FAVD and TimeIT_didemo.

Jun 26 '24 07:06 Andy1621

Thanks for your sharing!

Jun 26 '24 09:06 LiJiaqi96

Another question, how could I obtain the checkpoint after VideoChat2_HD training? in "demo_mistral_hd.ipynb".
state_dict = torch.load("your_model_path/videochat2/videochat2_hd_mistral_stage4.pth", "cpu") I noticed that there are several files in the "ckpt_latest.pth" folder, should I choose one of them?
Thanks!

Jun 28 '24 02:06 LiJiaqi96

These datasets are generated from ShareGPTVideo, VidLN, FAVD and TimeIT_didemo.

Hi, could you please help me find the instruction json files such as f"{anno_root_it}/video/caption/sharegptvideo/train_300k.json", I did not find the json files in the HF VideoChat2-IT repo.

Jun 28 '24 07:06 LiJiaqi96

Sorry for the late reply. For the checkpoint, you need to use the file named mp_xxx which saves weights. For the instruction data, I will upload it today.

Jun 28 '24 23:06 Andy1621

@LiJiaqi96 Please check the data in HuggingFace~

Jun 29 '24 04:06 Andy1621

Thanks for your reply! I will try it~

Jun 30 '24 10:06 LiJiaqi96

BTW, did you evaluate the effectiveness of the VideoChat2_HD and the newly added datasets, respectively? I'm curious about whether the training scheme or the dataset matters more for the improvement. Thanks!

Jul 01 '24 07:07 LiJiaqi96

We do not conduct serious comparisons since we want to make good use of pretrained models.

And I think both are important based on some experiments:

Stage4: Directly fine-tuning VideoChat2-Stage3 with HD on the original Stage3-dataset improved marginally.
Stage3: Fine-tuning VideoChat2-Stage2 with Stage4-dataset leads to performance drop by ~3%.

Jul 01 '24 08:07 Andy1621

My experiment is consistent with your findings. I directly fine-tuning VideoChat2-Stage3 (trained by myself from Stage2, 3 epochs) with HD on the original Stage3-dataset (1 epoch), and the score on the MVBench drops from 56 to 43 ...

Jul 02 '24 01:07 LiJiaqi96

Interesting! I think HD needs more high-resolution and high-quality data.

Jul 03 '24 01:07 Andy1621

These datasets are generated from ShareGPTVideo, VidLN, FAVD and TimeIT_didemo.

Hi, while downloading the datasets, I could not find the "infovqa". Could you please help me find the dataset?

Aug 14 '24 10:08 LiJiaqi96

Questions about VideoChat2_HD