ColossalAI OOM for OPT 30B model using GeminiDDP [BUG]:

🐛 Describe the bug

Hello, I am using OPT code to try 30B model on 8 A-100 GPUs. I got OOM when BS is lager than 8 in using GeminiDDP. I am not sure if it is normal because I am able to use BS 32 on the zero-offload without GeminiDDP. I set the PLACEMENT_POLICY = 'auto' and get OOM errors. It is ok when I use 'cpu' PLACEMENT_POLICY. Appreciate it if you have any ideas about it.

Environment

Colossal-AI version: 0.1.13
----------------------------
PyTorch Version: 1.12.1
PyTorch Version required by Colossal-AI: 1.12
PyTorch version match: ✓
----------------------------
System CUDA Version: 11.3
CUDA Version required by PyTorch: 11.3
CUDA Version required by Colossal-AI: 11.3
CUDA Version Match: ✓
----------------------------
CUDA Extension: ✓

Dec 30 '22 22:12 MikeChenfu

That's right, policy cpu is more stable than auto. I suggest you use CPU for large model. We will check the problem in auto implementation. you can refer the benchmark results in https://github.com/hpcaitech/ColossalAI/blob/main/examples/language/gpt/README.md

Dec 31 '22 04:12 feifeibear

Thanks @feifeibear for the quick reply. I'll check the benchmark results and let me know if you have any updates. Many thanks!

Dec 31 '22 04:12 MikeChenfu

@MikeChenfu I have updated the OPT example code. And provide detailed benchmark results for the GPT example. I suppose OPT and GPT have similar performances. How much CPU memory did you used to training 30B model on 8 GPU?

Jan 07 '23 11:01 feifeibear

Hello @feifeibear, thanks for the reply and new code. I'll check it, For CPU memory in one node, I usually set 1.9T for the 30B model on 8 GPUs.

Jan 07 '23 16:01 MikeChenfu

We have updated a lot. This issue was closed due to inactivity. Thanks.

Apr 14 '23 08:04 binmakeswell