ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: Parameter `model.norm.weight` failed at the gradient reduction.

Open wzh125 opened this issue 1 year ago • 0 comments

🐛 Describe the bug

File "/usr/local/lib/miniconda3/envs/colo/lib/python3.10/site-packages/colossalai/tensor/colo_tensor.py", line 81, in torch_function File "/usr/local/lib/miniconda3/envs/colo/lib/python3.10/site-packages/colossalai/tensor/colo_tensor.py", line 81, in torch_function return backward_tensor.backward(**tensor_kwargs) File "/usr/local/lib/miniconda3/envs/colo/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward result = torch_func_method(public_api, types, args, kwargs) File "/usr/local/lib/miniconda3/envs/colo/lib/python3.10/site-packages/colossalai/tensor/colo_tensor.py", line 81, in torch_function return backward_tensor.backward(**tensor_kwargs)return backward_tensor.backward(**tensor_kwargs)

File "/usr/local/lib/miniconda3/envs/colo/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward File "/usr/local/lib/miniconda3/envs/colo/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward return backward_tensor.backward(**tensor_kwargs) File "/usr/local/lib/miniconda3/envs/colo/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/usr/local/lib/miniconda3/envs/colo/lib/python3.10/site-packages/torch/autograd/init.py", line 200, in backward torch.autograd.backward(torch.autograd.backward(

File "/usr/local/lib/miniconda3/envs/colo/lib/python3.10/site-packages/torch/autograd/init.py", line 200, in backward File "/usr/local/lib/miniconda3/envs/colo/lib/python3.10/site-packages/torch/autograd/init.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/usr/local/lib/miniconda3/envs/colo/lib/python3.10/site-packages/colossalai/zero/gemini/gemini_ddp.py", line 344, in grad_handle torch.autograd.backward( File "/usr/local/lib/miniconda3/envs/colo/lib/python3.10/site-packages/torch/autograd/init.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward passVariable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass

File "/usr/local/lib/miniconda3/envs/colo/lib/python3.10/site-packages/colossalai/zero/gemini/gemini_ddp.py", line 344, in grad_handle File "/usr/local/lib/miniconda3/envs/colo/lib/python3.10/site-packages/colossalai/zero/gemini/gemini_ddp.py", line 344, in grad_handle raise RuntimeError( Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward passRuntimeError : File "/usr/local/lib/miniconda3/envs/colo/lib/python3.10/site-packages/colossalai/zero/gemini/gemini_ddp.py", line 344, in grad_handle Parameter model.norm.weight failed at the gradient reduction. Some unsupported torch function is operated upon this parameter. raise RuntimeError(raise RuntimeError(

RuntimeErrorRuntimeError: : Parameter model.norm.weight failed at the gradient reduction. Some unsupported torch function is operated upon this parameter.Parameter model.norm.weight failed at the gradient reduction. Some unsupported torch function is operated upon this parameter.

when I chose gemini and use_flash_attn at the same time, the error occurs.

Environment

python=10.8 pytorch=2.0.0 cuda=11.8 flash_attn=2.0.5

wzh125 avatar Mar 14 '24 04:03 wzh125