VideoTuna icon indicating copy to clipboard operation
VideoTuna copied to clipboard

Unable to save checkpoint during multi-GPU training

Open fzuo1230 opened this issue 10 months ago • 3 comments

Hello, I am training wan-2.1-i2v-14b on 8-card A800, but when saving the checkpoint, I get the following error:

Training failed: [Errno 17] File exists: '/mmu_mllm_hdd_2/zuofei/VideoTuna/results/train/train_wanvideo_i2v_fullft_20250612153116/checkpoints/flow' [rank1]: Traceback (most recent call last): [rank1]: File "/mmu_mllm_hdd_2/zuofei/VideoTuna/scripts/train_new.py", line 131, in [rank1]: trainer.fit(flow, data, ckpt_path=train_config.resume_ckpt) [rank1]: File "/hetu_group/zuofei/env/wan_sft/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 538, in fit [rank1]: call._call_and_handle_interrupt( [rank1]: File "/hetu_group/zuofei/env/wan_sft/lib/python3.12/site-packages/pytorch_lightning/trainer/call.py", line 46, in _call_and_handle_interrupt [rank1]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/hetu_group/zuofei/env/wan_sft/lib/python3.12/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch [rank1]: return function(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/hetu_group/zuofei/env/wan_sft/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 574, in _fit_impl [rank1]: self._run(model, ckpt_path=ckpt_path) [rank1]: File "/hetu_group/zuofei/env/wan_sft/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 981, in _run [rank1]: results = self._run_stage() [rank1]: ^^^^^^^^^^^^^^^^^ [rank1]: File "/hetu_group/zuofei/env/wan_sft/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 1025, in _run_stage [rank1]: self.fit_loop.run() [rank1]: File "/hetu_group/zuofei/env/wan_sft/lib/python3.12/site-packages/pytorch_lightning/loops/fit_loop.py", line 205, in run [rank1]: self.advance() [rank1]: File "/hetu_group/zuofei/env/wan_sft/lib/python3.12/site-packages/pytorch_lightning/loops/fit_loop.py", line 363, in advance [rank1]: self.epoch_loop.run(self._data_fetcher) [rank1]: File "/hetu_group/zuofei/env/wan_sft/lib/python3.12/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 140, in run [rank1]: self.advance(data_fetcher) [rank1]: File "/hetu_group/zuofei/env/wan_sft/lib/python3.12/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 269, in advance [rank1]: call._call_callback_hooks(trainer, "on_train_batch_end", batch_output, batch, batch_idx) [rank1]: File "/hetu_group/zuofei/env/wan_sft/lib/python3.12/site-packages/pytorch_lightning/trainer/call.py", line 218, in _call_callback_hooks [rank1]: fn(trainer, trainer.lightning_module, *args, **kwargs) [rank1]: File "/mmu_mllm_hdd_2/zuofei/VideoTuna/videotuna/utils/callbacks.py", line 99, in on_train_batch_end [rank1]: self._save_last_checkpoint(trainer, monitor_candidates, pl_module) # only save the last checkpoint [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mmu_mllm_hdd_2/zuofei/VideoTuna/videotuna/utils/callbacks.py", line 133, in _save_last_checkpoint [rank1]: self._save_checkpoint(trainer, filepath, pl_module) [rank1]: File "/mmu_mllm_hdd_2/zuofei/VideoTuna/videotuna/utils/callbacks.py", line 146, in _save_checkpoint [rank1]: self._save_flow_checkpoint(trainer, pl_module, filepath) [rank1]: File "/mmu_mllm_hdd_2/zuofei/VideoTuna/videotuna/utils/callbacks.py", line 171, in _save_flow_checkpoint [rank1]: os.makedirs(new_dirpath) [rank1]: File "", line 225, in makedirs [rank1]: FileExistsError: [Errno 17] File exists: '/mmu_mllm_hdd_2/zuofei/VideoTuna/results/train/train_wanvideo_i2v_fullft_20250612153116/checkpoints/flow'

It seems that this is because the models are parallel during training, and each GPU needs to save its own checkpoint when saving. How can I solve this problem?

fzuo1230 avatar Jun 12 '25 07:06 fzuo1230

Hi @fzuo1230 can you share the steps you used to run on multiple GPUs?

Akshaysharma29 avatar Jul 02 '25 07:07 Akshaysharma29

Hi @fzuo1230 can you share the steps you used to run on multiple GPUs?

i have solved this problem, thanks for your reply. But i found that As the training process increases, the loss of full-scale fine-tuning seems a bit abnormal. After we fine-tuned for two epochs with nearly 15,000 pieces of data, the loss was very volatile, sometimes even exceeding 1. Using the saved checkpoint inference, we found that it was all noise. Is it because I set the training parameters incorrectly?

flow:
  target: videotuna.flow.wanvideo.WanVideoModelFlow
  params:
    task: "i2v-14B"                  
    ckpt_path: "/mmu_mllm_hdd/zuofei/model_param/Wan2.1-I2V-14B-720P"                    
    offload_model: true               
    ulysses_size: 1                  
    ring_size: 1                     
    t5_fsdp: false                    
    t5_cpu: false                    
    dit_fsdp: false                   
    use_prompt_extend: false          
    prompt_extend_method: "local_qwen" 
    prompt_extend_model: null         
    prompt_extend_target_lang: "zh"   
    seed: 42                     

    denoiser_config:
      target: videotuna.models.wan.wan.modules.model.WanModel
      use_from_pretrained: true
      params:
        pretrained_model_name_or_path: ${flow.params.ckpt_path}

    first_stage_config:
      target: videotuna.models.wan.wan.modules.vae.WanVAE_
      params:
        dim: 96
        z_dim: 16
        dim_mult: [1, 2, 4, 4]
        num_res_blocks: 2
        attn_scales: []
        temperal_downsample: [false, true, true]
        dropout: 0.0

    cond_stage_config:
      target: videotuna.models.wan.wan.modules.t5.T5Encoder
      params:
        dim: 4096
        dim_attn: 4096
        dim_ffn: 10240
        num_heads: 64
        num_buckets: 32
        shared_pos: false
        dropout: 0.1
        vocab: 256384
        num_layers: 24

      
    cond_stage_2_config:
      target: videotuna.models.wan.wan.modules.clip.XLMRobertaCLIP
      params:
        embed_dim: 1024
        image_size: 224
        patch_size: 14
        vision_dim: 1280
        vision_mlp_ratio: 4
        vision_heads: 16
        vision_layers: 32
        vision_pool: "token"
        activation: "gelu"
        vocab_size: 250002
        max_text_len: 514
        type_size: 1
        pad_id: 1
        text_dim: 1024
        text_heads: 16
        text_layers: 24
        text_post_norm: true
        text_dropout: 0.1
        attn_dropout: 0.0
        proj_dropout: 0.0
        embedding_dropout: 0.0

train:
  ckpt: /mmu_mllm_hdd/zuofei/model_param/Wan2.1-I2V-14B-720P
  name: train_wan_i2v_fullft
  logdir: results/train
  seed: 42
  debug: false         
  first_stage_key: video
  cond_stage_key: caption
  mapping:
    train.ckpt : flow.params.ckpt_path

  lr_config:
    base_learning_rate: 6.0e-6
    scale_lr: False

  data:
    target: videotuna.data.lightningdata.DataModuleFromConfig
    params:
      batch_size: 1
      num_workers: 2
      wrap: false
      train:
        target: videotuna.data.datasets.DatasetFromCSV
        params:
          csv_path: dataset/exp_caption.csv
          height: 960
          width: 720
          num_frames: 81
          frame_interval: 1
          train: True

  lightning:
    strategy: deepspeed_stage_3_offload
    trainer:
      accelerator: gpu
      benchmark: True
      num_nodes: 4
      accumulate_grad_batches: 1
      max_epochs: 3
      precision: bf16-mixed
    callbacks:
      image_logger:
        target: videotuna.utils.callbacks.ImageLogger
        params:
          batch_frequency: 50
          max_images: 6
          to_local: True # save videos into files
          log_images_kwargs:
            unconditional_guidance_scale: 12.0 # need this, otherwise it is grey
      model_checkpoint:
        target: videotuna.utils.callbacks.VideoTunaModelCheckpoint
        params:
          filename: "{epoch:03}-{step:09}"
          save_only_selected_model: True
          selected_model: ["denoiser"]
          save_weights_only: False
          save_on_train_epoch_end: False
          save_last: True
          every_n_epochs: 0
          every_n_train_steps: 100

inference:
  mode: i2v
  ckpt_path: /mmu_mllm_hdd/zuofei/model_param/Wan2.1-I2V-14B-720P
  savedir: results/i2v/wanvideo
  seed: 42
  height: 1280
  width: 720
  prompt_dir: "inputs/i2v/576x1024"
  solver: "unipc"           
  num_inference_steps: 50                
  time_shift: 5.0                
  unconditional_guidance_scale: 5.0                       
  frames: 81
  n_samples_prompt: 1
  bs: 1
  savefps: 16
  enable_model_cpu_offload: true

  mapping:
    inference.ckpt_path : flow.params.ckpt_path
    inference.seed : flow.params.seed
    inference.enable_model_cpu_offload : flow.params.offload_model

fzuo1230 avatar Jul 09 '25 06:07 fzuo1230

@fzuo1230 , I'm having the same error. How did you solve this?

adwaykanhere avatar Sep 13 '25 13:09 adwaykanhere