train_text_to_image_sdxl.py Can't save model at checkpoint
Describe the bug
I am trying to finetune SDXL but the training script crashes when saving the model at a checkpoint. Training runs fine.
Reproduction
Here is my accelerate config choices:
- This machine
- No distributed training
- No
- No
- yes (to use deepspeed)
- no (don't specify a json)
- 2 (deepspeed's zero optimization stage 2)
- cpu (to offload optimizer states on the cpu)
- none (don't offload parameters)
- 4
- no
- no
- 1
- fp16
Then I run this taken from the example in examples/text_to_image/README_sdxl.md
export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
export VAE_NAME="madebyollin/sdxl-vae-fp16-fix"
export DATASET_NAME="lambdalabs/pokemon-blip-captions"
Here I only modified the checkpointing_steps to cause the error to happen faster
accelerate launch train_text_to_image_sdxl.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--pretrained_vae_model_name_or_path=$VAE_NAME \
--dataset_name=$DATASET_NAME \
--enable_xformers_memory_efficient_attention \
--resolution=512 --center_crop --random_flip \
--proportion_empty_prompts=0.2 \
--train_batch_size=1 \
--gradient_accumulation_steps=4 --gradient_checkpointing \
--max_train_steps=10000 \
--use_8bit_adam \
--learning_rate=1e-06 --lr_scheduler="constant" --lr_warmup_steps=0 \
--mixed_precision="fp16" \
--validation_prompt="a cute Sundar Pichai creature" --validation_epochs 5 \
--checkpointing_steps=5\
--output_dir="sdxl-pokemon-model"
Logs
... (training starts) ...
Steps: 0%| | 4/10000 [00:36<21:14:27, 7.65s/it, lr=1e-6, step_loss=0.0118][2024-03-14 02:46:19,909] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456, reducing to 134217728
Steps: 0%| | 5/10000 [00:38<21:02:03, 7.58s/it, lr=1e-6, step_loss=0.0118]03/14/2024 02:46:19 - INFO - accelerate.accelerator - Saving current state to sdxl-pokemon-model/checkpoint-5
03/14/2024 02:46:19 - INFO - accelerate.accelerator - Saving DeepSpeed Model and Optimizer
[2024-03-14 02:46:19,913] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint pytorch_model is about to be saved!
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1877: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
[2024-03-14 02:46:19,937] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: sdxl-pokemon-model/checkpoint-5/pytorch_model/mp_rank_00_model_states.pt
[2024-03-14 02:46:19,937] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving sdxl-pokemon-model/checkpoint-5/pytorch_model/mp_rank_00_model_states.pt...
[2024-03-14 02:46:36,762] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved sdxl-pokemon-model/checkpoint-5/pytorch_model/mp_rank_00_model_states.pt.
[2024-03-14 02:46:36,766] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving sdxl-pokemon-model/checkpoint-5/pytorch_model/zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2024-03-14 02:47:03,094] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved sdxl-pokemon-model/checkpoint-5/pytorch_model/zero_pp_rank_0_mp_rank_00_optim_states.pt.
[2024-03-14 02:47:03,095] [INFO] [engine.py:3488:_save_zero_checkpoint] zero checkpoint saved sdxl-pokemon-model/checkpoint-5/pytorch_model/zero_pp_rank_0_mp_rank_00_optim_states.pt
[2024-03-14 02:47:03,095] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint pytorch_model is ready now!
03/14/2024 02:47:03 - INFO - accelerate.accelerator - DeepSpeed Model and Optimizer saved to output dir sdxl-pokemon-model/checkpoint-5/pytorch_model
Configuration saved in sdxl-pokemon-model/checkpoint-5/unet/config.json
Traceback (most recent call last):
File "/root/diffusers/examples/text_to_image/train_text_to_image_sdxl.py", line 1312, in <module>
main(args)
File "/root/diffusers/examples/text_to_image/train_text_to_image_sdxl.py", line 1169, in main
accelerator.save_state(save_path)
File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2706, in save_state
hook(self._models, weights, output_dir)
File "/root/diffusers/examples/text_to_image/train_text_to_image_sdxl.py", line 731, in save_model_hook
model.save_pretrained(os.path.join(output_dir, "unet"))
File "/root/diffusers/src/diffusers/models/modeling_utils.py", line 369, in save_pretrained
safetensors.torch.save_file(
File "/usr/local/lib/python3.10/dist-packages/safetensors/torch.py", line 232, in save_file
serialize_file(_flatten(tensors), filename, metadata=metadata)
File "/usr/local/lib/python3.10/dist-packages/safetensors/torch.py", line 394, in _flatten
raise RuntimeError(
RuntimeError:
Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'down_blocks.2.attentions.0.transformer_blocks.9.norm1.weight', 'up_blocks.0.attentions.0.transformer_blocks.5.attn2.to_out.0.bias', 'up_blocks.1.attentions.0.transformer_blocks.1.attn2.to_v.weight',
...... (lots of layers) .....
'up_blocks.0.attentions.2.transformer_blocks.8.norm1.bias', 'up_blocks.0.attentions.0.transformer_blocks.1.attn2.to_out.0.bias'}].
A potential way to correctly save your model is to use `save_model`.
More information at https://huggingface.co/docs/safetensors/torch_shared_tensors
[2024-03-14 02:47:08,088] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 10510) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1002, in launch_command
deepspeed_launcher(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 718, in deepspeed_launcher
distrib_run.run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train_text_to_image_sdxl.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-03-14_02:47:08
host : 4e28de93c858
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 10510)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
System Info
-
diffusersversion: 0.27.0.dev0 - Platform: Linux-5.4.0-150-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- PyTorch version (GPU?): 2.2.1+cu121 (True)
- Huggingface_hub version: 0.21.4
- Transformers version: 4.36.2
- Accelerate version: 0.25.0
- xFormers version: 0.0.24
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: No
I have an RTX4090.
Who can help?
@sayakpaul
Does it happen without DeepSpeed? I am sadly not well-versed in DeepSpeed, so cannot help much.
https://github.com/huggingface/diffusers/pull/6628/files should fix the problem I think.
@sayakpaul I need deepspeed, otherwise training won't start (nvidia out of memory error)
I have edited my comment. See if that helps.
I have added the configuration in the command as
accelerate launch --config_file $ACCELERATE_CONFIG_FILE train_text_to_image_sdxl.py --pretrained_model_name_or_
path=$MODEL_NAME --pretrained_vae_model_name_or_path=$VAE_NAME --dataset_name=$DATASET_NAME --enable_xformers_memory_efficient_attention --resolution=512 --center_crop --random_flip --proportion_empty_prompts=0.2 --train_batch_size=1 --gradient_accumulation_steps=4 --gradient_checkpointing --max_train_steps=10000 --use_8bit_adam --learning_rate=1e-06 --lr_scheduler="constant" --lr_warmup_steps=0 --mixed_precision="fp16" --validation_prompt="a cute Sundar Pichai creature" --validation_epochs 5 --checkpointing_steps=5 --output_dir="sdxl-pokemon-model"
but the same problem happens.
This was the config.yaml file I had
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
gradient_accumulation_steps: 4
gradient_clipping: 1.0
offload_optimizer_device: cpu
offload_param_device: none
zero3_init_flag: false
zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
How about applying changes from https://github.com/huggingface/diffusers/pull/6628/? More specifically, the changes introduced in examples/text_to_image/train_text_to_image_lora_sdxl.py?
@sayakpaul Trying
if isinstance(unwrap_model(model), type(unwrap_model(unet))):
model.save_pretrained(os.path.join(output_dir, "unet"))
in the code didn't change the error
How about:
if isinstance(unwrap_model(model), type(unwrap_model(unet))):
+ unwrap_model(model).save_pretrained(os.path.join(output_dir, "unet"))
?
@sayakpaul The same error happens, even with
if isinstance(unwrap_model(model), type(unwrap_model(unet))):
+ unwrap_model(model).save_pretrained(os.path.join(output_dir, "unet"))
I have also tried running the train_text_to_image_lora_sdxl.py to see if worked and got the same error as in train_text_to_image_sdxl.py.
Deactivating deepspeed makes train_text_to_image_lora_sdxl.py work fine.
Cc: @HelloWorldBeginner. Could you help here if you any pointers?
I haven't used cpu offload in deepseed, but it's fine to use zero2 on 8xA100s.
@clement-swk When using deepspeed, you can install apex, which will be automatically used in deepspeed. That works for me.
git clone https://github.com/NVIDIA/apex.git
cd apex git checkout 741bdf50825a97664db08574981962d66436d16a
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ --global-option="--cuda_ext" --global-option="--cpp_ext"
@AoqunJin Thanks for your reply! I tried installing apex but the problem remains.
@clement-swk
You can also try removing accelerator.is_main_process. This will avoid having to call save_model only in the main process without being able to get the states of other devices.
In
def save_model_hook(models, weights, output_dir):
# if accelerator.is_main_process:
And
train_loss = 0.0
# if accelerator.is_main_process:
if global_step % args.checkpointing_steps == 0:
At train_text_to_image_lora_sdxl.py
@AoqunJin I tried and it the same error appeared.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Will give this a look.
I proposed a couple of fixes here: https://github.com/huggingface/accelerate/issues/2787. Does this help?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.