accelerator.prepare() get OOM,but available in single GPU
System Info
- `Accelerate` version: 1.0.1
- Platform: Linux-5.4.0-169-generic-x86_64-with-glibc2.35
- `accelerate` bash location: /opt/conda/bin/accelerate
- Python version: 3.10.14
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.3.0+cu118 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 2015.00 GB
- GPU type: NVIDIA A800-SXM4-40GB
- `Accelerate` default config:
Not found
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainerscript in theexamplesfolder of thetransformersrepo (such asrun_no_trainer_glue.py) - [X] My own task or dataset (give details below)
Reproduction
import accelerate
from accelerate import DistributedDataParallelKwargs
from transformers import GPT2Model
accelerator = accelerate.Accelerator(kwargs_handlers=[ddp_kwargs])
model = GPT2Model.from_pretrained(args.model_dir,output_hidden_states = True)
if args.pretrain == 1 and args.freeze == 1:
peft_config = LoraConfig(
r=128,
lora_alpha=256,
lora_dropout=0.1,
)
model = get_peft_model(model, peft_config)
model = accelerator.prepare(model)
Expected behavior
Here is the information:
Traceback (most recent call last):
File "/workspace/Graph-Network/main.py", line 174, in <module>
model = accelerator.prepare(model)
File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1350, in prepare
result = tuple(
File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1351, in <genexpr>
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1226, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1460, in prepare_model
model = model.to(self.device)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1173, in to
return self._apply(convert)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 779, in _apply
module._apply(fn)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 779, in _apply
module._apply(fn)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 779, in _apply
module._apply(fn)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 804, in _apply
param_applied = fn(param)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1159, in convert
return t.to(
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
It's confusing that CUDA raise OOM but unlike others, it did not even try to allocate any GPU memory. In fact, my GPUs are empty according to nvidia-smi
Thanks for reporting. Could you please:
- Share the output of
accelerate env - Tell us how you run the script
- Tell us what PEFT version you're using
- What is the model in
args.model_dir? - If you comment out
model = get_peft_model(model, peft_config), do you get the same error?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
any solution?
any solution?
Seems like a bug exists in transformers, when importing transformers, it will replace CUDA_VISIBLE_DEVICES that set by myself. I modified train script from accelerate launch CUDA_VISIBLE_DEVICES=0,1,2 to export CUDA_VISIBLE_DEVICES=0,1,2 accelerate luanch and solved my issue, this may not effect to your situation, but provide a reference to you. Hope it can help you.