MiniCPM-V icon indicating copy to clipboard operation
MiniCPM-V copied to clipboard

Error occurred when fine-tuning MiniCPM-V-4_5

Open wzr0108 opened this issue 4 months ago • 3 comments

I used finetune/finetune_lora.sh.

#!/bin/bash

GPUS_PER_NODE=3
NNODES=1
NODE_RANK=0
MASTER_ADDR=localhost
MASTER_PORT=6001
 
MODEL="/mnt/241hdd/wzr/MiniCPM-V-CookBook/MiniCPM-V-4_5"
# or openbmb/MiniCPM-V-2, openbmb/MiniCPM-Llama3-V-2_5, openbmb/MiniCPM-V-2_6
# ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations.
# See the section for finetuning in README for more information.
DATA="/mnt/241hdd/wzr/MiniCPM-V/finetune/single_image_training_data.json"
EVAL_DATA="/mnt/241hdd/wzr/MiniCPM-V/finetune/single_image_training_data.json"
# if use openbmb/MiniCPM-V-2, please set LLM_TYPE=minicpm, if use openbmb/MiniCPM-Llama3-V-2_5, please set LLM_TYPE="llama3",
# if use openbmb/MiniCPM-o-2_6 or openbmb/MiniCPM-V-2_6, please set LLM_TYPE=qwen
LLM_TYPE="qwen"   
MODEL_MAX_Length=2048 # if conduct multi-images sft, please set MODEL_MAX_Length=4096

DISTRIBUTED_ARGS="
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT
"
export CUDA_VISIBLE_DEVICES=4,5,6
export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1
# libcudart.so.11.0
torchrun $DISTRIBUTED_ARGS finetune.py  \
    --model_name_or_path $MODEL \
    --llm_type $LLM_TYPE \
    --data_path $DATA \
    --eval_data_path $EVAL_DATA \
    --remove_unused_columns false \
    --label_names "labels" \
    --prediction_loss_only false \
    --bf16 false \
    --bf16_full_eval false \
    --fp16 true \
    --fp16_full_eval true \
    --do_train \
    --do_eval \
    --tune_vision true \
    --tune_llm false \
    --use_lora true \
    --lora_target_modules "llm\..*layers\.\d+\.self_attn\.(q_proj|k_proj|v_proj|o_proj)" \
    --model_max_length $MODEL_MAX_Length \
    --max_slice_nums 9 \
    --max_steps 10000 \
    --eval_steps 1000 \
    --output_dir output/output__lora \
    --logging_dir output/output_lora \
    --logging_strategy "steps" \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --eval_strategy "steps" \
    --save_strategy "steps" \
    --save_steps 1000 \
    --save_total_limit 10 \
    --learning_rate 1e-6 \
    --weight_decay 0.1 \
    --adam_beta2 0.95 \
    --warmup_ratio 0.01 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --gradient_checkpointing true \
    --deepspeed ds_config_zero2.json \
    --report_to "tensorboard" # wandb

And I encountered this error.

TypeError: MiniCPMV.__init__() got an unexpected keyword argument 'init_vision'

The full output is as follows.

W0903 19:23:39.858000 587341 torch/distributed/run.py:793] 
W0903 19:23:39.858000 587341 torch/distributed/run.py:793] *****************************************
W0903 19:23:39.858000 587341 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0903 19:23:39.858000 587341 torch/distributed/run.py:793] *****************************************
[2025-09-03 19:23:53,808] [INFO] [real_accelerator.py:260:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-09-03 19:23:53,811] [INFO] [real_accelerator.py:260:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-09-03 19:23:53,812] [INFO] [real_accelerator.py:260:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-09-03 19:23:57,956] [INFO] [logging.py:107:log_dist] [Rank -1] [TorchCheckpointEngine] Initialized with serialization = False
[2025-09-03 19:23:57,956] [INFO] [logging.py:107:log_dist] [Rank -1] [TorchCheckpointEngine] Initialized with serialization = False
[2025-09-03 19:23:57,957] [INFO] [logging.py:107:log_dist] [Rank -1] [TorchCheckpointEngine] Initialized with serialization = False
[2025-09-03 19:24:04,979] [INFO] [comm.py:821:init_distributed] cdb=None
[2025-09-03 19:24:04,979] [INFO] [comm.py:852:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2025-09-03 19:24:04,979] [INFO] [comm.py:821:init_distributed] cdb=None
[2025-09-03 19:24:04,980] [INFO] [comm.py:821:init_distributed] cdb=None
[rank2]: Traceback (most recent call last):
[rank2]:   File "/mnt/241hdd/wzr/MiniCPM-V/finetune/finetune.py", line 302, in <module>
[rank2]:     train()
[rank2]:   File "/mnt/241hdd/wzr/MiniCPM-V/finetune/finetune.py", line 200, in train
[rank2]:     model = AutoModel.from_pretrained(
[rank2]:   File "/mnt/241hdd/wzr/MiniCPM-V/.venv/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 593, in from_pretrained
[rank2]:     return model_class.from_pretrained(
[rank2]:   File "/mnt/241hdd/wzr/MiniCPM-V/.venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 316, in _wrapper
[rank2]:     return func(*args, **kwargs)
[rank2]:   File "/mnt/241hdd/wzr/MiniCPM-V/.venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4986, in from_pretrained
[rank2]:     model = cls(config, *model_args, **model_kwargs)
[rank2]: TypeError: MiniCPMV.__init__() got an unexpected keyword argument 'init_vision'
[rank1]: Traceback (most recent call last):
[rank1]:   File "/mnt/241hdd/wzr/MiniCPM-V/finetune/finetune.py", line 302, in <module>
[rank1]:     train()
[rank1]:   File "/mnt/241hdd/wzr/MiniCPM-V/finetune/finetune.py", line 200, in train
[rank1]:     model = AutoModel.from_pretrained(
[rank1]:   File "/mnt/241hdd/wzr/MiniCPM-V/.venv/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 593, in from_pretrained
[rank1]:     return model_class.from_pretrained(
[rank1]:   File "/mnt/241hdd/wzr/MiniCPM-V/.venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 316, in _wrapper
[rank1]:     return func(*args, **kwargs)
[rank1]:   File "/mnt/241hdd/wzr/MiniCPM-V/.venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4986, in from_pretrained
[rank1]:     model = cls(config, *model_args, **model_kwargs)
[rank1]: TypeError: MiniCPMV.__init__() got an unexpected keyword argument 'init_vision'
[rank0]: Traceback (most recent call last):
[rank0]:   File "/mnt/241hdd/wzr/MiniCPM-V/finetune/finetune.py", line 302, in <module>
[rank0]:     train()
[rank0]:   File "/mnt/241hdd/wzr/MiniCPM-V/finetune/finetune.py", line 200, in train
[rank0]:     model = AutoModel.from_pretrained(
[rank0]:   File "/mnt/241hdd/wzr/MiniCPM-V/.venv/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 593, in from_pretrained
[rank0]:     return model_class.from_pretrained(
[rank0]:   File "/mnt/241hdd/wzr/MiniCPM-V/.venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 316, in _wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/mnt/241hdd/wzr/MiniCPM-V/.venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4986, in from_pretrained
[rank0]:     model = cls(config, *model_args, **model_kwargs)
[rank0]: TypeError: MiniCPMV.__init__() got an unexpected keyword argument 'init_vision'
[rank0]:[W903 19:24:06.172538963 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
W0903 19:24:08.066000 587341 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 587426 closing signal SIGTERM
E0903 19:24:08.131000 587341 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 587424) of binary: /mnt/241hdd/wzr/MiniCPM-V/.venv/bin/python3
Traceback (most recent call last):
  File "/mnt/241hdd/wzr/MiniCPM-V/.venv/bin/torchrun", line 10, in <module>
    sys.exit(main())
  File "/mnt/241hdd/wzr/MiniCPM-V/.venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/mnt/241hdd/wzr/MiniCPM-V/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
    run(args)
  File "/mnt/241hdd/wzr/MiniCPM-V/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/mnt/241hdd/wzr/MiniCPM-V/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/mnt/241hdd/wzr/MiniCPM-V/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
finetune.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2025-09-03_19:24:08
  host      : arc-wzr89727-dhzykd-7948fc9b7f-6dqt9
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 587425)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-09-03_19:24:08
  host      : arc-wzr89727-dhzykd-7948fc9b7f-6dqt9
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 587424)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

wzr0108 avatar Sep 03 '25 11:09 wzr0108

Had the same problem. In the end I had to comment out init_vision, init_audio and init_tts in finetune.py in order to get it to run. Not sure how this will affect the training.

model = AutoModel.from_pretrained(
    model_args.model_name_or_path,
    trust_remote_code=True,
    torch_dtype=compute_dtype,
    device_map=device_map,
    #init_vision=True,
    #init_audio=False,
    #init_tts=False,
)

adeobootpin avatar Sep 05 '25 04:09 adeobootpin

@qyc-98 PTAL

tc-mb avatar Sep 05 '25 04:09 tc-mb

By default, the fine-tuning code loads minicpm-o. If you want to fine-tune minicpm-v instead, you can comment out the init_vision parameters section.

qyc-98 avatar Sep 05 '25 04:09 qyc-98