ColossalAI how to speed up llama 2 70b model loading

Describe the feature

I am following Colossal-LLaMA-2 to continue pretraining. I am using a 8x a100 80G node. And I am using gemini plugin. It took more than a hour to load. Is there any method to speed up this? If I interrupt with ctrl+c, it's here:

    if args.load_checkpoint is None:
        coordinator.print_on_master(f"Load pretrained model checkpoint from {args.pretrained}")
        booster.load_model(model, args.pretrained, strict=False)

Nov 06 '23 12:11 fancyerii

Hi, are you loading from a pretrained Huggingface checkpoint? Is the download time included in this hour?

Nov 07 '23 03:11 Fridge003

@Fridge003 yes, I have download hf checkpoint to local disk. There is no download time.

Nov 08 '23 02:11 fancyerii

I am using 2 nodes, each node have 8 A100 40GB gpus. I find gemini will first use 41GB cpu memory and 4-5GB gpu memory for each process(gpu) and then gradually increase gpu memory. After loading, it use about 25-30GB gpu memory. When running a long time, it use 82GB cpu memory(res in top command) and 38GB gpu memory.

Nov 08 '23 02:11 fancyerii

I am using 2 nodes, each node have 8 A100 40GB gpus. I find gemini will first use 41GB cpu memory and 4-5GB gpu memory for each process(gpu) and then gradually increase gpu memory. After loading, it use about 25-30GB gpu memory. When running a long time, it use 82GB cpu memory(res in top command) and 38GB gpu memory.

Thanks, we will check this issue

Nov 08 '23 03:11 Fridge003

Hi, I just tested the loading speed on an 8xa800 80G node, and it took 75s to load a 7b model with Gemini Plugin. So the loading time for a 70b model should be around 15min, it shouldn't exceed one hour. Would you please provide environment information and arguments in your script?

Nov 09 '23 09:11 Fridge003

python 3.9 pytorch 1.13.1+cu117 colossalai 0.3.3 ubuntu 18.04 LTS transformer 4.33.3 start command:

colossalai run --nproc_per_node 8 --host xxx.xxx.xxx.3,xxx.xxx.xxx.4 --master_addr xxx.xxx.xxx.3 --master_port 29500   train.py     --pretrained /nas/lili/models_hf/70B-chat     --dataset "/nas/lili/colossalaitest/Colossal-LLaMA-2/spliced_tokenized_output_arrow/part-00000 /nas/lili/colossalaitest/Colossal-LLaMA-2/spliced_tokenized_output_arrow/part-00001 /nas/lili/colossalaitest/Colossal-LLaMA-2/spliced_tokenized_output_arrow/part-00002 /nas/lili/colossalaitest/Colossal-LLaMA-2/spliced_tokenized_output_arrow/part-00003 /nas/lili/colossalaitest/Colossal-LLaMA-2/spliced_tokenized_output_arrow/part-00004 /nas/lili/colossalaitest/Colossal-LLaMA-2/spliced_tokenized_output_arrow/part-00005 /nas/lili/colossalaitest/Colossal-LLaMA-2/spliced_tokenized_output_arrow/part-00006 /nas/lili/colossalaitest/Colossal-LLaMA-2/spliced_tokenized_output_arrow/part-00007 /nas/lili/colossalaitest/Colossal-LLaMA-2/spliced_tokenized_output_arrow/part-00008 /nas/lili/colossalaitest/Colossal-LLaMA-2/spliced_tokenized_output_arrow/part-00009"     --plugin "gemini_auto"     --save_interval 3000     --save_dir "/nas/pretrain/70b_rv"     --tensorboard_dir /nas/pretrain/tb70b     --config_file config-70b.json     --num_epochs 1     --micro_batch_size 4     --lr 1e-4     --mixed_precision "bf16"     --grad_clip 1.0     --weight_decay 0.01  --use_grad_checkpoint     --use_flash_attn  --max_length 1024

Nov 10 '23 02:11 fancyerii

Can a node with 4 * H100 80GB run Llama-2 70B full-parameter finetuning? Thanks!

Apr 20 '24 10:04 jacklanda