FLUX.1-dev dreambooth save problem trained on multigpu

Open jyy-1998 opened this issue 1 year ago • 1 comments

Describe the bug

I tried to train flux using accelerate and deepspeed, but when using two L40s, the model could not be saved properly. What is the problem?

Reproduction

train.sh: accelerate launch --config_file config.yaml train_flux.py
--pretrained_model_name_or_path="./FLUX.1-dev"
--resolution=1024
--train_batch_size=1
--output_dir="output1"
--num_train_epochs=10
--checkpointing_steps=5
--validation_steps=500
--max_train_steps=40001
--learning_rate=4e-05
--seed=12345
--mixed_precision="fp16"
--revision="fp16"
--use_8bit_adam
--gradient_accumulation_steps=1
--gradient_checkpointing
--lr_scheduler="constant_with_warmup" --lr_warmup_steps=2500 \

config.yaml: compute_environment: LOCAL_MACHINE debug: false deepspeed_config: gradient_accumulation_steps: 1 gradient_clipping: 1.0 offload_optimizer_device: cpu offload_param_device: cpu zero3_init_flag: false zero_stage: 2 distributed_type: DEEPSPEED downcast_bf16: 'no' gpu_ids: 0,1 enable_cpu_affinity: false machine_rank: 0 main_training_function: main mixed_precision: fp16 num_machines: 1 num_processes: 2 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

Logs

Using /home/oppoer/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00030350685119628906 seconds
10/21/2024 02:58:18 - INFO - __main__ - ***** Running training *****
10/21/2024 02:58:18 - INFO - __main__ -   Num examples = 2109730
10/21/2024 02:58:18 - INFO - __main__ -   Num batches each epoch = 1054865
10/21/2024 02:58:18 - INFO - __main__ -   Num Epochs = 1
10/21/2024 02:58:18 - INFO - __main__ -   Instantaneous batch size per device = 1
10/21/2024 02:58:18 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 2
10/21/2024 02:58:18 - INFO - __main__ -   Gradient Accumulation steps = 1
10/21/2024 02:58:18 - INFO - __main__ -   Total optimization steps = 40001
Steps:   0%|                                                                                                                                                                    | 0/40001 [00:00<?, ?it/s]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Using /home/oppoer/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0007116794586181641 seconds
[2024-10-21 02:58:29,496] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648
Steps:   0%|                                                                                                                                      | 1/40001 [00:11<127:38:44, 11.49s/it, loss=0.544, lr=0]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
[2024-10-21 02:58:36,774] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648, reducing to 1073741824
Steps:   0%|                                                                                                                                       | 2/40001 [00:18<100:07:40,  9.01s/it, loss=0.36, lr=0]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
[2024-10-21 02:58:44,052] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824, reducing to 536870912
Steps:   0%|                                                                                                                                       | 3/40001 [00:26<91:19:39,  8.22s/it, loss=0.543, lr=0]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
[2024-10-21 02:58:51,324] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912, reducing to 268435456
Steps:   0%|                                                                                                                                        | 4/40001 [00:33<87:10:01,  7.85s/it, loss=1.14, lr=0]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
[2024-10-21 02:58:58,612] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456, reducing to 134217728
Steps:   0%|                                                                                                                                        | 5/40001 [00:40<84:55:54,  7.64s/it, loss=1.14, lr=0]10/21/2024 02:58:58 - INFO - accelerate.accelerator - Saving current state to output0/checkpoint-5
10/21/2024 02:58:58 - INFO - accelerate.accelerator - Saving DeepSpeed Model and Optimizer
[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1316, OpType=REDUCE, Timeout(ms)=1800000) ran for 1805073 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1806456 milliseconds before timing out.
[2024-10-21 03:29:05,325] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint pytorch_model is about to be saved!
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
Traceback (most recent call last):
  File "/home/notebook/data/personal/sd_inpainting_flux/train_flux.py", line 1519, in <module>
    main()
  File "/home/notebook/data/personal/sd_inpainting_flux/train_flux.py", line 1347, in main
    accelerator.backward(loss)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 2126, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 166, in backward
    self.engine.backward(loss, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1862, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1901, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/opt/conda/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 810, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1258, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 836, in reduce_independent_p_g_buckets_and_remove_grads
    self.reduce_ipg_grads()
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1208, in reduce_ipg_grads
    self.average_tensor(self.ipg_buffer[self.ipg_index])
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 986, in average_tensor
    async_handle = dist.reduce(grad_slice, dst=dst_rank, group=real_dp_process_group[i], async_op=True)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 116, in log_wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 426, in reduce
    return cdb.reduce(tensor=tensor, dst=dst, op=op, group=group, async_op=async_op)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 108, in reduce
    return torch.distributed.reduce(tensor=tensor, dst=dst, op=self._reduce_op(op), group=group, async_op=async_op)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1872, in reduce
    work = group.reduce([tensor], opts)
RuntimeError: NCCL communicator was aborted on rank 1.  Original reason for failure was: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1316, OpType=REDUCE, Timeout(ms)=1800000) ran for 1805073 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 9250 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 9251) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1082, in launch_command
    deepspeed_launcher(args)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 786, in deepspeed_launcher
    distrib_run.run(args)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=====================================================
train_flux.py FAILED
-----------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-10-21_03:29:14
  host      : task-20241021092224-98003
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 9251)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 9251
=====================================================

System Info

torch==2.0.1 cuda=11.8

Who can help?

No response

Oct 21 '24 03:10 jyy-1998

For DeepSpeed, model saving needs to be handled a little differently than with something like single-GPU or DDP. You can take a look at this reference, which we tested recently for saving/loading checkpoints when training with DeepSpeed

Oct 21 '24 21:10 a-r-r-o-w