meareabc

Results 2 issues of meareabc

When I set more than 2 gpus (4 or 6), I will get a tensor size error, but when I set it to 2 it works will. is there some...

在训练过程中,保存第一次模型后,训练进度长时间不更新,这是我的配置 #!/bin/bash # Distributed training configuration MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"} MASTER_PORT=${MASTER_PORT:-$(shuf -i 20001-29999 -n 1)} export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NPROC_PER_NODE=8 NNODES=1 # DeepSpeed configuration deepspeed=./scripts/zero3.json # Model configuration llm=./base_models/Qwen2.5-VL-7B-Instruct # Using HuggingFace model ID #...