[BUG]: 用3张A100显卡使用examples的opt demo会爆显存
🐛 Describe the bug
使用的模型是opt-66b,使用的是80gb显存的A100显卡
使用的脚本bash ./run_gemini.sh 1 0 66b 3
我们直接跑脚本之后,在加载预训练模型时,3张显卡会同时加到80gb,然后报CUDA out of memory
后来改成模型加载到内存的时候使用了400gb的内存,然后爆另外的错误
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/colossalai/torch_extensions/torch1.10_cu11.3/build.ninja...
Building extension module fused_optim...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_optim...
Time to load fused_optim op: 1.2893073558807373 seconds
[02/16/23 15:09:50] INFO colossalai - colossalai - INFO: /root/miniconda3/lib/python3.8/site-packages/colossalai/nn/optimizer/zero_optimizer.py:217 step
[02/16/23 15:09:50] INFO colossalai - colossalai - INFO: /root/miniconda3/lib/python3.8/site-packages/colossalai/nn/optimizer/zero_optimizer.py:217 step
[02/16/23 15:09:50] INFO colossalai - colossalai - INFO: /root/miniconda3/lib/python3.8/site-packages/colossalai/nn/optimizer/zero_optimizer.py:217 step
INFO colossalai - colossalai - INFO: Found overflow. Skip step
INFO colossalai - colossalai - INFO: Found overflow. Skip step
INFO colossalai - colossalai - INFO: Found overflow. Skip step
INFO colossalai - colossalai - INFO: train_gemini_opt.py:205 main
INFO colossalai - colossalai - INFO: step 0 finished, Tflops 25.13966826409652
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 505485 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 505487 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Unable to shutdown process 505487 via 15, forcefully exitting via 9
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 1 (pid: 505486) of binary: /root/miniconda3/bin/python
Traceback (most recent call last):
File "/root/miniconda3/bin/torchrun", line 8, in
sys.exit(main())
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/run.py", line 719, in main
run(args)
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train_gemini_opt.py FAILED
Environment
No response
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
Title: [BUG]: Using the opt demo of examples with 3 A100 graphics cards will burst the video memory
I have validated this issue: in both cases (loaded to cuda memory and cpu memory), there was OOM. Would u please try a smaller model for now, such as 1.3b?
This problem is caused by initializing OOM, since the model opt-66B should occupy around 130G gpu memory which way surpasses the A100 memory. I have updated a shardinit option in the example script to fix this issue. https://github.com/nemoramo/ColossalAI/commit/7ff04147bcc0a96f2393b8b9657006e12078a70d