cpu_offload with diffusers save_pretrained occurs the error: NotImplementedError: Cannot copy out of meta tensor; no data!
System Info
accelerate: 0.24.1
diffusers: 0.27.0
transformers: 4.30.2
The error:
Traceback (most recent call last):
File "/mnt/vdb/qingluo/DiffusionDPO/train.py", line 1248, in <module>
main()
File "/mnt/vdb/qingluo/DiffusionDPO/train.py", line 1241, in main
pipeline.save_pretrained(args.output_dir)
File "/mnt/vdb/qingluo/env/lib/python3.10/site-packages/diffusers/pipelines/pipeline_utils.py", line 279, in save_pretrained
save_method(os.path.join(save_directory, pipeline_component_name), **save_kwargs)
File "/mnt/vdb/qingluo/env/lib/python3.10/site-packages/diffusers/models/modeling_utils.py", line 369, in save_pretrained
safetensors.torch.save_file(
File "/mnt/vdb/qingluo/env/lib/python3.10/site-packages/safetensors/torch.py", line 284, in save_file
serialize_file(_flatten(tensors), filename, metadata=metadata)
File "/mnt/vdb/qingluo/env/lib/python3.10/site-packages/safetensors/torch.py", line 488, in _flatten
return {
File "/mnt/vdb/qingluo/env/lib/python3.10/site-packages/safetensors/torch.py", line 492, in <dictcomp>
"data": _tobytes(v, k),
File "/mnt/vdb/qingluo/env/lib/python3.10/site-packages/safetensors/torch.py", line 414, in _tobytes
tensor = tensor.to("cpu")
NotImplementedError: Cannot copy out of meta tensor; no data!
Steps: : 0it [00:01, ?it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3110) of binary: /mnt/vdb/qingluo/env/bin/python
Traceback (most recent call last):
File "/workspace/qingluo/env/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/mnt/vdb/qingluo/env/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/mnt/vdb/qingluo/env/lib/python3.10/site-packages/accelerate/commands/launch.py", line 985, in launch_command
multi_gpu_launcher(args)
File "/mnt/vdb/qingluo/env/lib/python3.10/site-packages/accelerate/commands/launch.py", line 654, in multi_gpu_launcher
distrib_run.run(args)
File "/mnt/vdb/qingluo/env/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/mnt/vdb/qingluo/env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/mnt/vdb/qingluo/env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainerscript in theexamplesfolder of thetransformersrepo (such asrun_no_trainer_glue.py) - [X] My own task or dataset (give details below)
Reproduction
from accelerate import cpu_offload text_encoder.to(accelerator.device, dtype=weight_dtype) text_encoder = cpu_offload(text_encoder) vae = cpu_offload(vae) pipeline = StableDiffusionPipeline.from_pretrained( args.pretrained_model_name_or_path, text_encoder=text_encoder, vae=vae, unet=unet, revision=args.revision, ) pipeline.save_pretrained(args.output_dir)
Expected behavior
expected output: we can save the model... When I use the version accelerate 0.20.2, diffusers 0.20.0 it works However when I update the version, it failed
Hi @zengziru, could you share a full minimal reproducer ?
When I use the version accelerate 0.20.2, diffusers 0.20.0 it works However when I update the version, it failed
Does this happen after you update the accelerate library or the diffusers library ?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.