[Bug] fine-tuning wan2.1 all noise, but the generated videos during the validation phase remain unchanged regardless of the number of fine-tuning steps.
Describe the bug
I'm fine-tuning the WAN2.1 model with our data, but the generated videos during the validation phase remain unchanged regardless of the number of fine-tuning steps.
we use the scrips of FastVideo/scripts/finetune/finetune_v1_VSA.sh and FastVideo/scripts/finetune/finetune_v1.sh
but the the generated videos during the validation phase remain unchanged regardless of the number of fine-tuning steps, such as:
https://github.com/user-attachments/assets/b006a20b-095b-433d-b885-c802f1df591e
https://github.com/user-attachments/assets/66f4c664-b9fb-43de-bcaf-02813d896e0c
However, when I use the transformer from the saved checkpoint for inference, the results are normal. I'd like to ask if there might be an issue with the validation sampling or somewhere else? Do you have any suggestions for modifications?
Reproduction
FastVideo/scripts/finetune/finetune_v1_VSA.sh FastVideo/scripts/finetune/finetune_v1.sh
export WANDB_PROJECT=fastvideo-vsa export TRITON_CACHE_DIR=/tmp/triton_cache export FASTVIDEO_ATTENTION_BACKEND=VIDEO_SPARSE_ATTN DATA_DIR=FastVideo/FVchitect/Vchitect_T2V_DataVerse/split_new/all VALIDATION_DIR=//FastVideo/FVchitect/Vchitect_T2V_DataVerse/split_new/latents2_resume2_61/validation_parquet_dataset/worker_0
NUM_GPUS=4
CHECKPOINT_PATH="$DATA_DIR/outputs/wan_finetune_vsa/checkpoint-1200"
If you do not have 32 GPUs and to fit in memory, you can: 1. increase sp_size. 2. reduce num_latent_t
torchrun --nnodes 1 --nproc_per_node $NUM_GPUS
--master_port=25368
FastVideo/fastvideo/v1/training/wan_training_pipeline.py
--model_path Wan-AI/Wan2.1-T2V-1.3B-Diffusers
--inference_mode False
--pretrained_model_name_or_path Wan-AI/Wan2.1-T2V-1.3B-Diffusers
--cache_dir "/home/ray/.cache"
--data_path "$DATA_DIR"
--validation_preprocessed_path "$VALIDATION_DIR"
--train_batch_size 1
--num_latent_t 8
--sp_size 1
--tp_size 1
--num_gpus $NUM_GPUS
--hsdp_replicate_dim $NUM_GPUS
--hsdp-shard-dim 1
--train_sp_batch_size 1
--dataloader_num_workers 4
--gradient_accumulation_steps 1
--max_train_steps 50000
--learning_rate 1e-5
--mixed_precision "bf16"
--checkpointing_steps 100
--validation_steps 500
--validation_sampling_steps "2,4,8,50"
--log_validation
--checkpoints_total_limit 3
--allow_tf32
--ema_start_step 0
--cfg 0.0
--output_dir "$DATA_DIR/outputs/wan_finetune_vsa"
--tracker_project_name VSA_finetune
--num_height 448
--num_width 832
--num_frames 61
--flow_shift 3
--validation_guidance_scale "5.0"
--num_euler_timesteps 50
--master_weight_type "fp32"
--dit_precision "fp32"
--weight_decay 0.01
--max_grad_norm 1.0
--VSA_sparsity 0.9
--VSA_decay_rate 0.03
--VSA_decay_interval_steps 30 \
Environment
same as the fastvideo
could you set --validation_sampling_steps to 50? This indicates the inference steps for validation
你能设置
--validation_sampling_steps为 50 吗?这表示验证的推理步骤
I have already tried "--validation_sampling_steps=50" ,but is does't work for vsa fine-tuning. Do you have any suggestions for modifications?
how did you preprocess the dataset? Did you use the same model?