[BUG]: llama2 hybrid_parallel or 3d giving None loss when using pp_size > 1
🐛 Describe the bug
Hi, I am trying to run llama2 7B model on yizhongw/self_instruct dataset. As title suggest, training with hybrid_parallel or 3d plugin giving None loss, but other plugin works as expected without any issue.
tp_size=4 pp_size=2
torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0 --rdzv_id=456 --rdzv_backend=c10d --rdzv_endpoint=x.x.x.x:x llama2/finetune.py --plugin "hybrid_parallel" --dataset "yizhongw/self_instruct" --model_path /workspace/ColossalAI/llama2-7b-hf/ --task_name "super_natural_instructions" --max_length 512 -e 10 -b 64 --lr 0.00002 --grad_checkpoint --mixed_precision bf16
Environment
Docker image: nvcr.io/nvidia/pytorch:23.12-py3 colossalai==0.3.6
Hi, Could you try pulling the latest main branch? I don't have trouble running pp_size = 2.
Hi, Unfortunately latest main branch doesn't have finetune.py file any more. Also, benchmark.py doesn't support huggingface dataset and it runs with random dataset. I want to run it on mentioned dataset with llama2 model.
I think the booster should support any dataset. Have you tried replacing the random dataset with this? https://github.com/hpcaitech/ColossalAI/blob/8020f4263095373e4c7ad1b15e54b966a8ccb683/examples/language/llama2/finetune.py#L209
I also did try replacing and modifying code. I still got None loss. Did you get valid loss value?
Actually, in pp only the last stage computes loss, so this is not a bug. You'll need to do this to see the actual loss.
Also, there's a llama fine-tuning example in applications/Colossal-LLaMA. Lmk if this addresses your question!
Thanks @Edenzzzz . That's solves the issue I had.