OOM for OPT 30B model using GeminiDDP [BUG]:
🐛 Describe the bug
Hello, I am using OPT code to try 30B model on 8 A-100 GPUs. I got OOM when BS is lager than 8 in using GeminiDDP. I am not sure if it is normal because I am able to use BS 32 on the zero-offload without GeminiDDP. I set the PLACEMENT_POLICY = 'auto' and get OOM errors. It is ok when I use 'cpu' PLACEMENT_POLICY. Appreciate it if you have any ideas about it.
Environment
Colossal-AI version: 0.1.13
----------------------------
PyTorch Version: 1.12.1
PyTorch Version required by Colossal-AI: 1.12
PyTorch version match: ✓
----------------------------
System CUDA Version: 11.3
CUDA Version required by PyTorch: 11.3
CUDA Version required by Colossal-AI: 11.3
CUDA Version Match: ✓
----------------------------
CUDA Extension: ✓
That's right, policy cpu is more stable than auto. I suggest you use CPU for large model. We will check the problem in auto implementation. you can refer the benchmark results in https://github.com/hpcaitech/ColossalAI/blob/main/examples/language/gpt/README.md
Thanks @feifeibear for the quick reply. I'll check the benchmark results and let me know if you have any updates. Many thanks!
@MikeChenfu I have updated the OPT example code. And provide detailed benchmark results for the GPT example. I suppose OPT and GPT have similar performances. How much CPU memory did you used to training 30B model on 8 GPU?
Hello @feifeibear, thanks for the reply and new code. I'll check it, For CPU memory in one node, I usually set 1.9T for the 30B model on 8 GPUs.
We have updated a lot. This issue was closed due to inactivity. Thanks.