ColossalAI [BUG]: llama2 hybrid_parallel or 3d giving None loss when using pp

🐛 Describe the bug

Hi, I am trying to run llama2 7B model on yizhongw/self_instruct dataset. As title suggest, training with hybrid_parallel or 3d plugin giving None loss, but other plugin works as expected without any issue.

tp_size=4 pp_size=2

torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0 --rdzv_id=456 --rdzv_backend=c10d --rdzv_endpoint=x.x.x.x:x llama2/finetune.py --plugin "hybrid_parallel" --dataset "yizhongw/self_instruct" --model_path /workspace/ColossalAI/llama2-7b-hf/ --task_name "super_natural_instructions" --max_length 512 -e 10 -b 64 --lr  0.00002 --grad_checkpoint --mixed_precision bf16

Environment

Docker image: nvcr.io/nvidia/pytorch:23.12-py3 colossalai==0.3.6

May 02 '24 18:05 PurvangL

Hi, Could you try pulling the latest main branch? I don't have trouble running pp_size = 2.

May 03 '24 05:05 Edenzzzz

Hi, Unfortunately latest main branch doesn't have finetune.py file any more. Also, benchmark.py doesn't support huggingface dataset and it runs with random dataset. I want to run it on mentioned dataset with llama2 model.

May 03 '24 19:05 PurvangL

I think the booster should support any dataset. Have you tried replacing the random dataset with this? https://github.com/hpcaitech/ColossalAI/blob/8020f4263095373e4c7ad1b15e54b966a8ccb683/examples/language/llama2/finetune.py#L209

May 04 '24 06:05 Edenzzzz

I also did try replacing and modifying code. I still got None loss. Did you get valid loss value?

May 06 '24 16:05 PurvangL

Actually, in pp only the last stage computes loss, so this is not a bug. You'll need to do this to see the actual loss. Also, there's a llama fine-tuning example in applications/Colossal-LLaMA. Lmk if this addresses your question!

May 07 '24 04:05 Edenzzzz

Thanks @Edenzzzz . That's solves the issue I had.

May 07 '24 22:05 PurvangL

[BUG]: llama2 hybrid_parallel or 3d giving None loss when using pp_size > 1

🐛 Describe the bug

Environment