ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: training GPT2-S using a single card on colab, AssertionError: You should use `zero_ddp_wrapper` first

Open LivinLuo1993 opened this issue 2 years ago • 3 comments

🐛 Describe the bug

when training GPT2-S using a single card on colab, !torchrun --standalone --nproc_per_node 1 benchmark_gpt_dummy.py --model s --strategy colossalai_gemini_cpu --experience_batch_size 1 --train_batch_size 1 meetting a bug "AssertionError: You should use zero_ddp_wrapper first ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 21342) of binary: /usr/bin/python3"

image

Environment

No response

LivinLuo1993 avatar Apr 04 '23 03:04 LivinLuo1993

Hi @LivinLuo1993 We did not conduct testing on Colab, and a Linux server is the preferred choice. Sorry about it.

binmakeswell avatar Apr 07 '23 03:04 binmakeswell

Hi @binmakeswell It seems not Colab problem, when I try to train PPO like stage3, I got same error. My environment: a docker container used huggingface/transformers-pytorch-gpu:4.23.0 image on a ubuntu server My command:

torchrun --standalone --nproc_per_node=1 train_prompts.py
--strategy colossalai_gemini
--prompt_path /home/data/instinwild_ch_1000.json
--pretrain_dataset /home/data/instinwild_ch_1000.json
--model "gpt2"
--pretrain "/home/saved_models_sft_demo"
--rm_model "gpt2"
--rm_path "/home/saved_models_rm_demo"
--save_path "/home/saved_models_actor_demo"
--num_episodes 10
--max_timesteps 10
--update_timesteps 10
--max_epochs 5
--train_batch_size 8
--ptx_batch_size 4
--experience_batch_size 8
--lora_rank 0
--kl_coef 0.1
--ptx_coef 0.9

Then I got this error:

Traceback (most recent call last): File "train_prompts.py", line 236, in main(args) File "train_prompts.py", line 170, in main (actor, actor_optim), (critic, critic_optim) = strategy.prepare((actor, actor_optim), (critic, critic_optim)) File "/usr/local/lib/python3.8/dist-packages/coati/trainer/strategies/base.py", line 84, in prepare optimizer = self.setup_optimizer(optimizer, self._unwrap_model(model)) File "/usr/local/lib/python3.8/dist-packages/coati/trainer/strategies/colossalai.py", line 147, in setup_optimizer return zero_optim_wrapper(model, optimizer, optim_config=self.zero_optim_config, **self.optim_kwargs) File "/usr/local/lib/python3.8/dist-packages/colossalai/zero/wrapper.py", line 83, in zero_optim_wrapper assert hasattr(model, "_colo_zero_stage"), "You should use zero_ddp_wrapper first" AssertionError: You should use zero_ddp_wrapper first ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3610) of binary: /usr/bin/python3

shuxueslpi avatar Apr 08 '23 02:04 shuxueslpi

meet the same error, anyone has solved it?

JerryYao80 avatar Apr 25 '23 08:04 JerryYao80

Using the latest code in main branch can solve "zero_ddp_wrapper" related error. Then if you encounter "assert isinstance(weight, ColoTensor)" error, I created a pull request #3666 to solve it.

zhang-yi-chi avatar Apr 28 '23 05:04 zhang-yi-chi