DiffSynth-Studio icon indicating copy to clipboard operation
DiffSynth-Studio copied to clipboard

Expected to have finished reduction in the prior iteration before starting a new one.

Open cxzhou35 opened this issue 3 months ago • 2 comments

I was finetuning the Flux.1-dev: Upscaler ControlNet model with my custom data, but met this problem during training:

[rank1]: RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=Truetotorch.nn.parallel.DistributedDataParallel``

Here are my training script:

#! /bin/bash
export NUM_NODES=1
export NUM_GPUS=2

export NCCL_DEBUG=IVFO
export CUDA_LAUNCH_BLOCKING=1

DATASET_BASE_PATH="/workspace/codes/DiffSynth-Studio/data/neemo_mini_1440p_120f/controlnet_data"
DATASET_METADATA_PATH="${DATASET_BASE_PATH}/metadata.csv"
OUTPUT_PATH="/workspace/codes/DiffSynth-Studio/outputs/neemo_mini_1440p_120f/train_finetune_controlnet/models/FLUX.1-dev-Controlnet-Upscaler_lora"
IMG_HEIGHT=1440
IMG_WIDTH=2560
DATASET_REPEAT=10
NUM_EPOCHS=5

accelerate launch --mixed_precision=bf16 --multi_gpu --main_process_port 29501 --num_machines $NUM_NODES --num_processes $NUM_GPUS --config_file examples/flux/model_training/full/accelerate_config.yaml examples/flux/model_training/train.py \
  --dataset_base_path $DATASET_BASE_PATH \
  --dataset_metadata_path $DATASET_METADATA_PATH \
  --data_file_keys "image,controlnet_image" \
  --dataset_repeat $DATASET_REPEAT \
  --height $IMG_HEIGHT \
  --width $IMG_WIDTH \
  --model_id_with_origin_paths "black-forest-labs/FLUX.1-dev:flux1-dev.safetensors,black-forest-labs/FLUX.1-dev:text_encoder/model.safetensors,black-forest-labs/FLUX.1-dev:text_encoder_2/,black-forest-labs/FLUX.1-dev:ae.safetensors,jasperai/Flux.1-dev-Controlnet-Upscaler:diffusion_pytorch_model.safetensors" \
  --learning_rate 1e-5 \
  --num_epochs $NUM_EPOCHS \
  --remove_prefix_in_ckpt "pipe.controlnet.models.0." \
  --output_path $OUTPUT_PATH \
  --trainable_models "controlnet" \
  --extra_inputs "controlnet_image" \
  --use_gradient_checkpointing

Below is the complete logs:

job-f2c0d17be585-20251017213424-worker-0.log

I am appreciate to any solution or suggestions. 🙏

cxzhou35 avatar Oct 17 '25 14:10 cxzhou35

@cxzhou35 Please check whether the dataset contains controlnet_image. Additionally, change export NCCL_DEBUG=IVFO to export NCCL_DEBUG=INFO to view detailed logs.

Artiprocher avatar Oct 20 '25 02:10 Artiprocher

@cxzhou35 Please check whether the dataset contains controlnet_image. Additionally, change export NCCL_DEBUG=IVFO to export NCCL_DEBUG=INFO to view detailed logs.

Thanks for your suggestion, I am sure the dataset contains controlnet_image:

image,prompt,controlnet_image
data/neemo_mini_1440p_120f/gt_images_1440p/32/000000.jpg,"A focused singer with sleek black hair and a white top performs under a bright spotlight amid professional equipment.",data/neemo_mini_1440p_120f/lq_images_720p/32/000000.jpg

I have fixed the typo, and here are the detailed logs:

job-1a954d25602b-20251020131936-worker-0.zip

cxzhou35 avatar Oct 20 '25 05:10 cxzhou35