[BUG]: training GPT2-S using a single card on colab, AssertionError: You should use `zero_ddp_wrapper` first
🐛 Describe the bug
when training GPT2-S using a single card on colab, !torchrun --standalone --nproc_per_node 1 benchmark_gpt_dummy.py --model s --strategy colossalai_gemini_cpu --experience_batch_size 1 --train_batch_size 1
meetting a bug "AssertionError: You should use zero_ddp_wrapper first
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 21342) of binary: /usr/bin/python3"

Environment
No response
Hi @LivinLuo1993 We did not conduct testing on Colab, and a Linux server is the preferred choice. Sorry about it.
Hi @binmakeswell It seems not Colab problem, when I try to train PPO like stage3, I got same error. My environment: a docker container used huggingface/transformers-pytorch-gpu:4.23.0 image on a ubuntu server My command:
torchrun --standalone --nproc_per_node=1 train_prompts.py
--strategy colossalai_gemini
--prompt_path /home/data/instinwild_ch_1000.json
--pretrain_dataset /home/data/instinwild_ch_1000.json
--model "gpt2"
--pretrain "/home/saved_models_sft_demo"
--rm_model "gpt2"
--rm_path "/home/saved_models_rm_demo"
--save_path "/home/saved_models_actor_demo"
--num_episodes 10
--max_timesteps 10
--update_timesteps 10
--max_epochs 5
--train_batch_size 8
--ptx_batch_size 4
--experience_batch_size 8
--lora_rank 0
--kl_coef 0.1
--ptx_coef 0.9
Then I got this error:
Traceback (most recent call last): File "train_prompts.py", line 236, in
main(args) File "train_prompts.py", line 170, in main (actor, actor_optim), (critic, critic_optim) = strategy.prepare((actor, actor_optim), (critic, critic_optim)) File "/usr/local/lib/python3.8/dist-packages/coati/trainer/strategies/base.py", line 84, in prepare optimizer = self.setup_optimizer(optimizer, self._unwrap_model(model)) File "/usr/local/lib/python3.8/dist-packages/coati/trainer/strategies/colossalai.py", line 147, in setup_optimizer return zero_optim_wrapper(model, optimizer, optim_config=self.zero_optim_config, **self.optim_kwargs) File "/usr/local/lib/python3.8/dist-packages/colossalai/zero/wrapper.py", line 83, in zero_optim_wrapper assert hasattr(model, "_colo_zero_stage"), "You should use zero_ddp_wrapperfirst" AssertionError: You should usezero_ddp_wrapperfirst ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3610) of binary: /usr/bin/python3
meet the same error, anyone has solved it?
Using the latest code in main branch can solve "zero_ddp_wrapper" related error. Then if you encounter "assert isinstance(weight, ColoTensor)" error, I created a pull request #3666 to solve it.