ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: OPT30B CUDA out of memory

Open iMountTai opened this issue 2 years ago • 3 comments

脚本:

set -x
export BS=${BS:-1}
export MEMCAP=${MEMCAP:-40}
export MODEL=${MODEL:-"30b"}
export GPUNUM=${GPUNUM:-8}

mkdir -p ./logs

export MODLE_PATH="facebook/opt-${MODEL}"

torchrun \
  --nproc_per_node ${GPUNUM} \
  --master_port 19198 \
  train_gemini_opt.py \
  --mem_cap ${MEMCAP} \
  --model_name_or_path ${MODLE_PATH} \
  --batch_size ${BS} 

Environment

torch: torch1.13cu117 GPU: 8*A100 48G python: 3.9

iMountTai avatar Feb 21 '23 12:02 iMountTai

The GPU memory is not enough. Please try smaller models such as opt1.3b and see if it works.

JThh avatar Feb 21 '23 17:02 JThh

OPT13B是可以的,但是OPT-30B日志显示应该是出错在模型加载那里 image

iMountTai avatar Feb 22 '23 02:02 iMountTai

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


OPT13B is ok, but the OPT-30B log shows that there should be an error in the model loading image

Issues-translate-bot avatar Feb 22 '23 02:02 Issues-translate-bot

Hi @iMountTai The most likely cause is insufficient CPU memory. This issue was closed due to inactivity. Thanks.

binmakeswell avatar Apr 20 '23 08:04 binmakeswell