🐛 Describe the bug

使用的模型是opt-66b，使用的是80gb显存的A100显卡使用的脚本bash ./run_gemini.sh 1 0 66b 3 我们直接跑脚本之后，在加载预训练模型时，3张显卡会同时加到80gb，然后报CUDA out of memory 后来改成模型加载到内存的时候使用了400gb的内存，然后爆另外的错误 Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/colossalai/torch_extensions/torch1.10_cu11.3/build.ninja... Building extension module fused_optim... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_optim... Time to load fused_optim op: 1.2893073558807373 seconds [02/16/23 15:09:50] INFO colossalai - colossalai - INFO: /root/miniconda3/lib/python3.8/site-packages/colossalai/nn/optimizer/zero_optimizer.py:217 step
[02/16/23 15:09:50] INFO colossalai - colossalai - INFO: /root/miniconda3/lib/python3.8/site-packages/colossalai/nn/optimizer/zero_optimizer.py:217 step
[02/16/23 15:09:50] INFO colossalai - colossalai - INFO: /root/miniconda3/lib/python3.8/site-packages/colossalai/nn/optimizer/zero_optimizer.py:217 step
INFO colossalai - colossalai - INFO: Found overflow. Skip step
INFO colossalai - colossalai - INFO: Found overflow. Skip step
INFO colossalai - colossalai - INFO: Found overflow. Skip step
INFO colossalai - colossalai - INFO: train_gemini_opt.py:205 main
INFO colossalai - colossalai - INFO: step 0 finished, Tflops 25.13966826409652
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 505485 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 505487 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Unable to shutdown process 505487 via 15, forcefully exitting via 9 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 1 (pid: 505486) of binary: /root/miniconda3/bin/python Traceback (most recent call last): File "/root/miniconda3/bin/torchrun", line 8, in sys.exit(main()) File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/run.py", line 719, in main run(args) File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run elastic_launch( File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train_gemini_opt.py FAILED

Environment

No response

Feb 16 '23 09:02 dreaaim

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

Title: [BUG]: Using the opt demo of examples with 3 A100 graphics cards will burst the video memory

Feb 16 '23 09:02 Issues-translate-bot

I have validated this issue: in both cases (loaded to cuda memory and cpu memory), there was OOM. Would u please try a smaller model for now, such as 1.3b?

Feb 18 '23 07:02 JThh

This problem is caused by initializing OOM, since the model opt-66B should occupy around 130G gpu memory which way surpasses the A100 memory. I have updated a shardinit option in the example script to fix this issue. https://github.com/nemoramo/ColossalAI/commit/7ff04147bcc0a96f2393b8b9657006e12078a70d

Mar 07 '23 09:03 nemoramo

[BUG]: 用3张A100显卡使用examples的opt demo会爆显存

🐛 Describe the bug

train_gemini_opt.py FAILED

Environment