diffusers DeepSpeed Error: return code -11 and status 245, RTX 3080 10GB

Describe the bug

I'm trying to train with the text encoder to improve face training, however get some very undescriptive errors. They are not OOM errors, or at least it doesn't say they are. A similar, but distinct issue (https://github.com/ShivamShrirao/diffusers/issues/112) says that the error line on Python just reports a subprocess error with no other information.

Attempting without the text encoder and with/without 8-bit Adam also do not work, giving the same error.

Training works as expected without the text encoder and without DeepSpeed (OOM with text encoder, works without).

I will attempt to fix this with a git pull and training on Stable Diffusion 1.4 instead of 1.5, but both require lengthy downloads and I will update when those are done. If the git pull fixes this issue, it'll still be good to have this bug documented for those who encounter it later, as Google turns up nothing. Update: Unfortunately, neither fixed nor changed the bug.

Reproduction

My accelerate config:

(diffusers) warm@DESKTOP-G2EREOM:~/github/diffusers/examples/dreambooth$ accelerate config
In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU [4] MPS): 0
Do you want to run your training on CPU only (even if a GPU is available)? [yes/NO]:no
Do you want to use DeepSpeed? [yes/NO]: yes
Do you want to specify a json file to a DeepSpeed config? [yes/NO]: no
What should be your DeepSpeed's ZeRO optimization stage (0, 1, 2, 3)? [2]: 2
Where to offload optimizer states? [none/cpu/nvme]: cpu
Where to offload parameters? [none/cpu/nvme]: cpu
How many gradient accumulation steps you're passing in your script? [1]: 1
Do you want to use gradient clipping? [yes/NO]: no
Do you want to enable `deepspeed.zero.Init` when using ZeRO Stage-3 for constructing massive models? [yes/NO]: no
How many GPU(s) should be used for distributed training? [1]:1
Do you wish to use FP16 or BF16 (mixed precision)? [NO/fp16/bf16]: fp16

My training scrpt:

export MODEL_NAME="runwayml/stable-diffusion-v1-5"
export INSTANCE_DIR="training/test"
export CLASS_DIR="classes/test"
export OUTPUT_DIR="output/test_te_3.0e-7"

accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --instance_data_dir=$INSTANCE_DIR \
  --class_data_dir=$CLASS_DIR \
  --output_dir=$OUTPUT_DIR \
  --instance_prompt="photo of sks woman" \
  --class_prompt="photo of woman" \
  --save_sample_prompt="masterpiece photo of sks woman, 4k, dlsr, canon, nikon"\
  --save_sample_negative_prompt="lowres, low quality, blurry, jpeg artifacts, text, error"\
  --seed=5555555 \
  --save_interval=500\
  --n_save_sample=9\
  --resolution=512 \
  --center_crop \
  --train_batch_size=1 \
  --mixed_precision="fp16" \
  --train_text_encoder \
  --gradient_accumulation_steps=1 --gradient_checkpointing \
  --learning_rate=3.0e-7 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --num_class_images=100 \
  --sample_batch_size=1 \
  --max_train_steps=4000\

Logs

./train_test_te.sh
The following values were not passed to `accelerate launch` and had defaults used instead:
        `--num_cpu_threads_per_process` was set to `12` to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
[2022-11-17 15:28:49,093] [WARNING] [runner.py:179:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2022-11-17 15:28:49,145] [INFO] [runner.py:508:main] cmd = /home/warm/anaconda3/envs/diffusers/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --no_local_rank train_dreambooth.py --pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5 --instance_data_dir=training/test --class_data_dir=classes/test --output_dir=output/test_te_3.0e-7 --instance_prompt=photo of sks woman --class_prompt=photo of woman --save_sample_prompt=masterpiece photo of sks woman, 4k, dlsr, canon, nikon --save_sample_negative_prompt=lowres, low quality, blurry, jpeg artifacts, text, error --seed=5555555 --save_interval=500 --n_save_sample=9 --resolution=512 --center_crop --train_batch_size=1 --mixed_precision=fp16 --train_text_encoder --gradient_accumulation_steps=1 --gradient_checkpointing --learning_rate=3.0e-7 --lr_scheduler=constant --lr_warmup_steps=0 --num_class_images=100 --sample_batch_size=1 --max_train_steps=4000
[2022-11-17 15:28:50,292] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]}
[2022-11-17 15:28:50,292] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=1, node_rank=0
[2022-11-17 15:28:50,292] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2022-11-17 15:28:50,292] [INFO] [launch.py:162:main] dist_world_size=1
[2022-11-17 15:28:50,292] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
[2022-11-17 15:28:52,237] [INFO] [comm.py:633:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Caching latents: 100%|████████████████████████████████████████████████████████████████| 112/112 [00:09<00:00, 11.69it/s]
[2022-11-17 15:29:15,800] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.7.5, git-hash=unknown, git-branch=unknown
/home/warm/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead
  warnings.warn(
[2022-11-17 15:29:15,921] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2022-11-17 15:29:15,922] [INFO] [logging.py:68:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer
[2022-11-17 15:29:15,922] [INFO] [logging.py:68:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2022-11-17 15:29:15,936] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW
[2022-11-17 15:29:15,936] [INFO] [utils.py:52:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
[2022-11-17 15:29:15,936] [INFO] [logging.py:68:log_dist] [Rank 0] Creating fp16 ZeRO stage 2 optimizer
[2022-11-17 15:29:15,936] [INFO] [stage_1_and_2.py:140:__init__] Reduce bucket size 500000000
[2022-11-17 15:29:15,936] [INFO] [stage_1_and_2.py:141:__init__] Allgather bucket size 500000000
[2022-11-17 15:29:15,936] [INFO] [stage_1_and_2.py:142:__init__] CPU Offload: True
[2022-11-17 15:29:15,936] [INFO] [stage_1_and_2.py:143:__init__] Round robin gradient partitioning: False
Using /home/warm/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
Emitting ninja build file /home/warm/.cache/torch_extensions/py39_cu116/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.14428973197937012 seconds
Rank: 0 partition count [1] and sizes[(982581444, False)]
[2022-11-17 15:29:18,622] [INFO] [utils.py:827:see_memory_usage] Before initializing optimizer states
[2022-11-17 15:29:18,622] [INFO] [utils.py:828:see_memory_usage] MA 3.82 GB         Max_MA 4.83 GB         CA 4.08 GB         Max_CA 5 GB
[2022-11-17 15:29:18,622] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 9.21 GB, percent = 59.0%
[2022-11-17 15:29:20,326] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 11943
[2022-11-17 15:29:20,327] [ERROR] [launch.py:324:sigkill_handler] ['/home/warm/anaconda3/envs/diffusers/bin/python', '-u', 'train_dreambooth.py', '--pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5', '--instance_data_dir=training/test', '--class_data_dir=classes/test', '--output_dir=output/test_te_3.0e-7', '--instance_prompt=photo of sks woman', '--class_prompt=photo of woman', '--save_sample_prompt=masterpiece photo of sks woman, 4k, dlsr, canon, nikon', '--save_sample_negative_prompt=lowres, low quality, blurry, jpeg artifacts, text, error', '--seed=5555555', '--save_interval=500', '--n_save_sample=9', '--resolution=512', '--center_crop', '--train_batch_size=1', '--mixed_precision=fp16', '--train_text_encoder', '--gradient_accumulation_steps=1', '--gradient_checkpointing', '--learning_rate=3.0e-7', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--num_class_images=100', '--sample_batch_size=1', '--max_train_steps=4000'] exits with return code = -11
Traceback (most recent call last):
  File "/home/warm/anaconda3/envs/diffusers/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/warm/anaconda3/envs/diffusers/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
    args.func(args)
  File "/home/warm/anaconda3/envs/diffusers/lib/python3.9/site-packages/accelerate/commands/launch.py", line 827, in launch_command
    deepspeed_launcher(args)
  File "/home/warm/anaconda3/envs/diffusers/lib/python3.9/site-packages/accelerate/commands/launch.py", line 540, in deepspeed_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['deepspeed', '--no_local_rank', '--num_gpus', '1', 'train_dreambooth.py', '--pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5', '--instance_data_dir=training/test', '--class_data_dir=classes/test', '--output_dir=output/test_te_3.0e-7', '--instance_prompt=photo of sks woman', '--class_prompt=photo of woman', '--save_sample_prompt=masterpiece photo of sks woman, 4k, dlsr, canon, nikon', '--save_sample_negative_prompt=lowres, low quality, blurry, jpeg artifacts, text, error', '--seed=5555555', '--save_interval=500', '--n_save_sample=9', '--resolution=512', '--center_crop', '--train_batch_size=1', '--mixed_precision=fp16', '--train_text_encoder', '--gradient_accumulation_steps=1', '--gradient_checkpointing', '--learning_rate=3.0e-7', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--num_class_images=100', '--sample_batch_size=1', '--max_train_steps=4000']' returned non-zero exit status 245.

System Info

Ryzen 9 5900X 32GB RAM RTX 3080 10GB Windows 10, WSL2, Ubuntu

Nov 17 '22 15:11 toomanydev

Facing the same issue with my RTX 3080

Nov 20 '22 10:11 basicsaki

where did you change "deepspeed.ops.adam.DeepSpeedCPUAdam"?

Jan 31 '23 19:01 runner22k