diffusers getting stuck at dreambooth training examples

I am trying to reproduce the 'Dog toy example' in https://github.com/huggingface/diffusers/tree/main/examples/dreambooth but I am getting stuck at https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth.py#L675 forever. Do you know how to fix it?

I have followed through the instructions for installing dependencies, and my GPU info is as follows:

Sat Dec 10 16:48:24 2022
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100-SXM... Off | 00000000:00:04.0 Off | 0 | | N/A 33C P0 54W / 400W | 12801MiB / 40960MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 12929 C /opt/conda/bin/python3.7 12799MiB | +-----------------------------------------------------------------------------+

Dec 10 '22 16:12 zihao12

Hey @zihao12,

could you please add a reproducible code snippet? I cannot reproduce the error just from the text above.

Dec 11 '22 16:12 patrickvonplaten

Hi @patrickvonplaten, I followed the instructions of https://github.com/huggingface/diffusers/tree/main/examples/dreambooth until the Dog Toy Example:

export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export INSTANCE_DIR="path-to-instance-images"
export OUTPUT_DIR="path-to-save-model"

accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path=$MODEL_NAME  \
  --instance_data_dir=$INSTANCE_DIR \
  --output_dir=$OUTPUT_DIR \
  --instance_prompt="a photo of sks dog" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --learning_rate=5e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=400

It gets stuck at epoch 0 (for more than 30 minutes), at the code https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth.py#L675

Dec 12 '22 00:12 zihao12

Hi @zihao12! I've been trying to reproduce this issue but the settings you posted work fine for me. Would you mind providing a little more information so we can try to sort this out? If possible, please follow these steps:

Run diffusers-cli env and paste a copy of the output here. From your previous post I see you are using a single 40 GB A100 GPU, is that correct?
Please, let us know the version of accelerate you have installed.
Also, if you can paste the contents of your default accelerate configuration that would be helpful: cat ~/.cache/huggingface/accelerate/default_config.yaml
Does training progress stall at the beginning of the process, at the end or somewhere in the middle? Does the progress bar show any progress at all?

Thanks a lot! I hope we can find and fix this issue with your help :)

Dec 12 '22 13:12 pcuenca

Hi @pcuenca! Thanks for your reply!

The training stalls at this line accelerator.backward(loss) https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth.py#L675
I am using a single 40 GB A100 GPU. Output from diffusers-cli env:

- `diffusers` version: 0.10.2
- Platform: Linux-4.19.0-22-cloud-amd64-x86_64-with-debian-10.13
- Python version: 3.7.12
- PyTorch version (GPU?): 1.12.1 (True)
- Huggingface_hub version: 0.10.1
- Transformers version: 4.25.1
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

Output from accelerate env:

- `Accelerate` version: 0.13.2
- Platform: Linux-4.19.0-22-cloud-amd64-x86_64-with-debian-10.13
- Python version: 3.7.12
- Numpy version: 1.21.6
- PyTorch version (GPU?): 1.12.1 (True)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: NO
        - mixed_precision: fp16
        - use_cpu: False
        - num_processes: 1
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: all
        - main_process_ip: None
        - main_process_port: None
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - deepspeed_config: {}
        - fsdp_config: {}
        - downcast_bf16: no

Output from cat ~/.cache/huggingface/accelerate/default_config.yaml:

compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: 'NO'
downcast_bf16: 'no'
fsdp_config: {}
gpu_ids: all
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
use_cpu: false

Dec 12 '22 15:12 zihao12

Thanks for your response! Your setup is very similar to mine, except that I have slightly newer versions of PyTorch (1.13 instead of 1.12.1) and accelerate (0.15.0 instead of 0.13.2). I'll do some more tests.

Dec 12 '22 17:12 pcuenca

I'm running into the same issue. Using a single A100 40Gb

Output from diffusers-cli env:

- `diffusers` version: 0.10.2
- Platform: Linux-4.19.0-22-cloud-amd64-x86_64-with-debian-10.13
- Python version: 3.7.12
- PyTorch version (GPU?): 1.12.1 (True)
- Huggingface_hub version: 0.11.1
- Transformers version: 4.25.1
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

Output from accelerate env:

- `Accelerate` version: 0.15.0
- Platform: Linux-4.19.0-22-cloud-amd64-x86_64-with-debian-10.13
- Python version: 3.7.12
- Numpy version: 1.21.6
- PyTorch version (GPU?): 1.12.1 (True)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: NO
        - mixed_precision: no
        - use_cpu: False
        - dynamo_backend: NO
        - num_processes: 1
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: all
        - main_process_ip: None
        - main_process_port: None
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - deepspeed_config: {}
        - fsdp_config: {}
        - megatron_lm_config: {}
        - downcast_bf16: no
        - tpu_name: None
        - tpu_zone: None
        - command_file: None
        - commands: None

Output from cat ~/.cache/huggingface/accelerate/default_config.yaml:

command_file: null
commands: null
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: 'NO'
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
gpu_ids: all
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
megatron_lm_config: {}
mixed_precision: 'no'
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_name: null
tpu_zone: null
use_cpu: false

Dec 13 '22 09:12 tzvc

Thanks @tzvc! I'll test on an A100 tomorrow.

Dec 13 '22 21:12 pcuenca

@pcuenca Tested on a T4 after, same setup, no problem

Dec 13 '22 23:12 tzvc

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jan 10 '23 15:01 github-actions[bot]

Sorry for the delay in getting back to you!

I could finally replicate this issue using A100 (40 GB) instances from Google Cloud, but not elsewhere. Specifically, I've found the problem when using pre-built images with PyTorch support, such as Debian 10 based Deep Learning VM for PyTorch CPU/GPU with CUDA 11.3 M102. I haven't been able to identify the underlying problem yet. Until we do, my recommendation would be to use one of the following workarounds:

Install one of the images without PyTorch, then install PyTorch manually using the process described here: https://pytorch.org/get-started/locally/.
Create a virtual environment for dreambooth training and install PyTorch inside the environment.

Jan 16 '23 14:01 pcuenca

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Feb 09 '23 15:02 github-actions[bot]

Closing as this appears to be some sort of problem with GCP images.

Feb 10 '23 17:02 pcuenca