how to speed up llama 2 70b model loading
Describe the feature
I am following Colossal-LLaMA-2 to continue pretraining. I am using a 8x a100 80G node. And I am using gemini plugin. It took more than a hour to load. Is there any method to speed up this? If I interrupt with ctrl+c, it's here:
if args.load_checkpoint is None:
coordinator.print_on_master(f"Load pretrained model checkpoint from {args.pretrained}")
booster.load_model(model, args.pretrained, strict=False)
Hi, are you loading from a pretrained Huggingface checkpoint? Is the download time included in this hour?
@Fridge003 yes, I have download hf checkpoint to local disk. There is no download time.
I am using 2 nodes, each node have 8 A100 40GB gpus. I find gemini will first use 41GB cpu memory and 4-5GB gpu memory for each process(gpu) and then gradually increase gpu memory. After loading, it use about 25-30GB gpu memory. When running a long time, it use 82GB cpu memory(res in top command) and 38GB gpu memory.
I am using 2 nodes, each node have 8 A100 40GB gpus. I find gemini will first use 41GB cpu memory and 4-5GB gpu memory for each process(gpu) and then gradually increase gpu memory. After loading, it use about 25-30GB gpu memory. When running a long time, it use 82GB cpu memory(res in top command) and 38GB gpu memory.
Thanks, we will check this issue
Hi, I just tested the loading speed on an 8xa800 80G node, and it took 75s to load a 7b model with Gemini Plugin. So the loading time for a 70b model should be around 15min, it shouldn't exceed one hour. Would you please provide environment information and arguments in your script?
python 3.9 pytorch 1.13.1+cu117 colossalai 0.3.3 ubuntu 18.04 LTS transformer 4.33.3 start command:
colossalai run --nproc_per_node 8 --host xxx.xxx.xxx.3,xxx.xxx.xxx.4 --master_addr xxx.xxx.xxx.3 --master_port 29500 train.py --pretrained /nas/lili/models_hf/70B-chat --dataset "/nas/lili/colossalaitest/Colossal-LLaMA-2/spliced_tokenized_output_arrow/part-00000 /nas/lili/colossalaitest/Colossal-LLaMA-2/spliced_tokenized_output_arrow/part-00001 /nas/lili/colossalaitest/Colossal-LLaMA-2/spliced_tokenized_output_arrow/part-00002 /nas/lili/colossalaitest/Colossal-LLaMA-2/spliced_tokenized_output_arrow/part-00003 /nas/lili/colossalaitest/Colossal-LLaMA-2/spliced_tokenized_output_arrow/part-00004 /nas/lili/colossalaitest/Colossal-LLaMA-2/spliced_tokenized_output_arrow/part-00005 /nas/lili/colossalaitest/Colossal-LLaMA-2/spliced_tokenized_output_arrow/part-00006 /nas/lili/colossalaitest/Colossal-LLaMA-2/spliced_tokenized_output_arrow/part-00007 /nas/lili/colossalaitest/Colossal-LLaMA-2/spliced_tokenized_output_arrow/part-00008 /nas/lili/colossalaitest/Colossal-LLaMA-2/spliced_tokenized_output_arrow/part-00009" --plugin "gemini_auto" --save_interval 3000 --save_dir "/nas/pretrain/70b_rv" --tensorboard_dir /nas/pretrain/tb70b --config_file config-70b.json --num_epochs 1 --micro_batch_size 4 --lr 1e-4 --mixed_precision "bf16" --grad_clip 1.0 --weight_decay 0.01 --use_grad_checkpoint --use_flash_attn --max_length 1024
Can a node with 4 * H100 80GB run Llama-2 70B full-parameter finetuning? Thanks!