Expected to have finished reduction in the prior iteration before starting a new one.
I was finetuning the Flux.1-dev: Upscaler ControlNet model with my custom data, but met this problem during training:
[rank1]: RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=Truetotorch.nn.parallel.DistributedDataParallel``
Here are my training script:
#! /bin/bash
export NUM_NODES=1
export NUM_GPUS=2
export NCCL_DEBUG=IVFO
export CUDA_LAUNCH_BLOCKING=1
DATASET_BASE_PATH="/workspace/codes/DiffSynth-Studio/data/neemo_mini_1440p_120f/controlnet_data"
DATASET_METADATA_PATH="${DATASET_BASE_PATH}/metadata.csv"
OUTPUT_PATH="/workspace/codes/DiffSynth-Studio/outputs/neemo_mini_1440p_120f/train_finetune_controlnet/models/FLUX.1-dev-Controlnet-Upscaler_lora"
IMG_HEIGHT=1440
IMG_WIDTH=2560
DATASET_REPEAT=10
NUM_EPOCHS=5
accelerate launch --mixed_precision=bf16 --multi_gpu --main_process_port 29501 --num_machines $NUM_NODES --num_processes $NUM_GPUS --config_file examples/flux/model_training/full/accelerate_config.yaml examples/flux/model_training/train.py \
--dataset_base_path $DATASET_BASE_PATH \
--dataset_metadata_path $DATASET_METADATA_PATH \
--data_file_keys "image,controlnet_image" \
--dataset_repeat $DATASET_REPEAT \
--height $IMG_HEIGHT \
--width $IMG_WIDTH \
--model_id_with_origin_paths "black-forest-labs/FLUX.1-dev:flux1-dev.safetensors,black-forest-labs/FLUX.1-dev:text_encoder/model.safetensors,black-forest-labs/FLUX.1-dev:text_encoder_2/,black-forest-labs/FLUX.1-dev:ae.safetensors,jasperai/Flux.1-dev-Controlnet-Upscaler:diffusion_pytorch_model.safetensors" \
--learning_rate 1e-5 \
--num_epochs $NUM_EPOCHS \
--remove_prefix_in_ckpt "pipe.controlnet.models.0." \
--output_path $OUTPUT_PATH \
--trainable_models "controlnet" \
--extra_inputs "controlnet_image" \
--use_gradient_checkpointing
Below is the complete logs:
job-f2c0d17be585-20251017213424-worker-0.log
I am appreciate to any solution or suggestions. 🙏
@cxzhou35 Please check whether the dataset contains controlnet_image. Additionally, change export NCCL_DEBUG=IVFO to export NCCL_DEBUG=INFO to view detailed logs.
@cxzhou35 Please check whether the dataset contains
controlnet_image. Additionally, changeexport NCCL_DEBUG=IVFOtoexport NCCL_DEBUG=INFOto view detailed logs.
Thanks for your suggestion, I am sure the dataset contains controlnet_image:
image,prompt,controlnet_image
data/neemo_mini_1440p_120f/gt_images_1440p/32/000000.jpg,"A focused singer with sleek black hair and a white top performs under a bright spotlight amid professional equipment.",data/neemo_mini_1440p_120f/lq_images_720p/32/000000.jpg
I have fixed the typo, and here are the detailed logs: