internvl2 video data train CUDA out of memory

Open gaotiexinqu opened this issue 1 year ago • 1 comments

我正在对internvl2使用视频数据进行full finetune，显卡为单张32G V100，报错torch.cuda.OutOfMemoryError: CUDA out of memory.

torchrun /cache/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py
--model_name_or_path /cache/MODELS/internvl2-4B
--conv_style "phi3-chat"
--output_dir /cache/InternVL/OUTPUTS/internvl_chat_v1_5_phi3_3_8b_dynamic_res_finetune_debug_load_2nd
--meta_path /cache/InternVL/internvl_chat/shell/data/internvl_1_2_finetune_7k.json
--overwrite_output_dir True
--force_image_size 448
--max_dynamic_patch 1
--down_sample_ratio 0.5
--drop_path_rate 0.1
--freeze_llm False
--freeze_mlp False
--freeze_backbone True
--vision_select_layer -1
--dataloader_num_workers 4
--bf16 False
--fp16 True
--num_train_epochs 1
--per_device_train_batch_size 1
--gradient_accumulation_steps 4
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 200
--save_total_limit 1
--learning_rate 4e-5
--weight_decay 0.05
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--max_seq_length 4096
--do_train True
--grad_checkpoint True
--group_by_length True
--dynamic_image_size True
--use_thumbnail True
--ps_version 'v2'
--deepspeed /cache/ZYM/InternVL/internvl_chat/zero_stage2_config.json
--report_to "tensorboard"
2>&1 | tee -a /cache/ZYM/InternVL/OUTPUTS/internvl_chat_v1_5_phi3_3_8b_dynamic_res_finetune_debug/training_log.txt

class LazySupervisedDataset(Dataset): min_num_frame=1, # for video data max_num_frame=1, # for video data

使用decord库加载视频，frames = read_frames_decord(fn, num_frames=max_num_frames, min_num_frames=min_num_frames, sample=sample, clip=clip)

batch_size设置为1, 虽然似乎视频数据不使用动态高分辨率，但我仍然将max_dynamic_patch设置为1 video_dataloader中将加载帧数固定为1

以上setting仍然显示显存占用溢出...

Aug 29 '24 06:08 gaotiexinqu

显存爆的话，可以把--freeze_llm 设为True，应该就不会溢出了

Sep 02 '24 11:09 ErfeiCui