getting stuck at dreambooth training examples
I am trying to reproduce the 'Dog toy example' in https://github.com/huggingface/diffusers/tree/main/examples/dreambooth but I am getting stuck at https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth.py#L675 forever. Do you know how to fix it?
I have followed through the instructions for installing dependencies, and my GPU info is as follows:
Sat Dec 10 16:48:24 2022
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100-SXM... Off | 00000000:00:04.0 Off | 0 | | N/A 33C P0 54W / 400W | 12801MiB / 40960MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 12929 C /opt/conda/bin/python3.7 12799MiB | +-----------------------------------------------------------------------------+
Hey @zihao12,
could you please add a reproducible code snippet? I cannot reproduce the error just from the text above.
Hi @patrickvonplaten, I followed the instructions of https://github.com/huggingface/diffusers/tree/main/examples/dreambooth until the Dog Toy Example:
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export INSTANCE_DIR="path-to-instance-images"
export OUTPUT_DIR="path-to-save-model"
accelerate launch train_dreambooth.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--instance_data_dir=$INSTANCE_DIR \
--output_dir=$OUTPUT_DIR \
--instance_prompt="a photo of sks dog" \
--resolution=512 \
--train_batch_size=1 \
--gradient_accumulation_steps=1 \
--learning_rate=5e-6 \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_train_steps=400
It gets stuck at epoch 0 (for more than 30 minutes), at the code https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth.py#L675
Hi @zihao12! I've been trying to reproduce this issue but the settings you posted work fine for me. Would you mind providing a little more information so we can try to sort this out? If possible, please follow these steps:
- Run
diffusers-cli envand paste a copy of the output here. From your previous post I see you are using a single 40 GB A100 GPU, is that correct? - Please, let us know the version of
accelerateyou have installed. - Also, if you can paste the contents of your default accelerate configuration that would be helpful:
cat ~/.cache/huggingface/accelerate/default_config.yaml - Does training progress stall at the beginning of the process, at the end or somewhere in the middle? Does the progress bar show any progress at all?
Thanks a lot! I hope we can find and fix this issue with your help :)
Hi @pcuenca! Thanks for your reply!
-
The training stalls at this line
accelerator.backward(loss)https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth.py#L675 -
I am using a single 40 GB A100 GPU. Output from
diffusers-cli env:
- `diffusers` version: 0.10.2
- Platform: Linux-4.19.0-22-cloud-amd64-x86_64-with-debian-10.13
- Python version: 3.7.12
- PyTorch version (GPU?): 1.12.1 (True)
- Huggingface_hub version: 0.10.1
- Transformers version: 4.25.1
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>
- Output from
accelerate env:
- `Accelerate` version: 0.13.2
- Platform: Linux-4.19.0-22-cloud-amd64-x86_64-with-debian-10.13
- Python version: 3.7.12
- Numpy version: 1.21.6
- PyTorch version (GPU?): 1.12.1 (True)
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: NO
- mixed_precision: fp16
- use_cpu: False
- num_processes: 1
- machine_rank: 0
- num_machines: 1
- gpu_ids: all
- main_process_ip: None
- main_process_port: None
- rdzv_backend: static
- same_network: True
- main_training_function: main
- deepspeed_config: {}
- fsdp_config: {}
- downcast_bf16: no
- Output from
cat ~/.cache/huggingface/accelerate/default_config.yaml:
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: 'NO'
downcast_bf16: 'no'
fsdp_config: {}
gpu_ids: all
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
use_cpu: false
Thanks for your response! Your setup is very similar to mine, except that I have slightly newer versions of PyTorch (1.13 instead of 1.12.1) and accelerate (0.15.0 instead of 0.13.2). I'll do some more tests.
I'm running into the same issue. Using a single A100 40Gb
Output from diffusers-cli env:
- `diffusers` version: 0.10.2
- Platform: Linux-4.19.0-22-cloud-amd64-x86_64-with-debian-10.13
- Python version: 3.7.12
- PyTorch version (GPU?): 1.12.1 (True)
- Huggingface_hub version: 0.11.1
- Transformers version: 4.25.1
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>
Output from accelerate env:
- `Accelerate` version: 0.15.0
- Platform: Linux-4.19.0-22-cloud-amd64-x86_64-with-debian-10.13
- Python version: 3.7.12
- Numpy version: 1.21.6
- PyTorch version (GPU?): 1.12.1 (True)
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: NO
- mixed_precision: no
- use_cpu: False
- dynamo_backend: NO
- num_processes: 1
- machine_rank: 0
- num_machines: 1
- gpu_ids: all
- main_process_ip: None
- main_process_port: None
- rdzv_backend: static
- same_network: True
- main_training_function: main
- deepspeed_config: {}
- fsdp_config: {}
- megatron_lm_config: {}
- downcast_bf16: no
- tpu_name: None
- tpu_zone: None
- command_file: None
- commands: None
Output from cat ~/.cache/huggingface/accelerate/default_config.yaml:
command_file: null
commands: null
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: 'NO'
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
gpu_ids: all
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
megatron_lm_config: {}
mixed_precision: 'no'
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_name: null
tpu_zone: null
use_cpu: false
Thanks @tzvc! I'll test on an A100 tomorrow.
@pcuenca Tested on a T4 after, same setup, no problem
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Sorry for the delay in getting back to you!
I could finally replicate this issue using A100 (40 GB) instances from Google Cloud, but not elsewhere. Specifically, I've found the problem when using pre-built images with PyTorch support, such as Debian 10 based Deep Learning VM for PyTorch CPU/GPU with CUDA 11.3 M102. I haven't been able to identify the underlying problem yet. Until we do, my recommendation would be to use one of the following workarounds:
- Install one of the images without PyTorch, then install PyTorch manually using the process described here: https://pytorch.org/get-started/locally/.
- Create a virtual environment for dreambooth training and install PyTorch inside the environment.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Closing as this appears to be some sort of problem with GCP images.