Zhifeng comments

Results 9 comments of


                                            Zhifeng

OSError: Unable to synchronously open file (file signature not found)

I have the same problem. How did you solve it in the end?

Key 'severity' is not in struct

same, is it solved???

训练时，同时启用 `deepspeed_stage_3` 和 `use_gradient_checkpointing` 以及 `use_gradient_checkpointing_offload` 会报错

I also encountered a similar error. My solution was to switch to deepspeed2. I hope my suggestions can help you.

训练时，同时启用 `deepspeed_stage_3` 和 `use_gradient_checkpointing` 以及 `use_gradient_checkpointing_offload` 会报错

> > I also encountered a similar error. My solution was to switch to deepspeed2. I hope my suggestions can help you. > > [@zfw-cv](https://github.com/zfw-cv) Thank you, but using `deepspeed_stage_2`...

We cannot detect the model type. No models are loaded.

> model_manager.load_models([ "models/lightning_logs/version_2/checkpoints/epoch=9-step=5000.ckpt", "models/Wan-AI/Wan2.1-T2V-14B/models_t5_umt5-xxl-enc-bf16.pth", "models/Wan-AI/Wan2.1-T2V-14B/Wan2.1_VAE.pth", ]) When training your own CKPT and following the tutorial to run test.by, Loading models from: models/lightning_logs/version_2/checkpoints/epoch=9-step=5000.ckpt We cannot detect the model type. No models...

We cannot detect the model type. No models are loaded.

> I use deepspeed to train i2v-14b model, but only optimizer is saved, I cannot find any model file. > > Hello, I also encountered a similar problem. I trained...

We cannot detect the model type. No models are loaded.

> after full training, i use zero_to_fp32.py convert *.pt to *.safetensors, when inference, i load the model, I also not report the log: No wan_video_dit models available. We cannot detect...

about batch size in Wan I2V training

> I would like to kindly follow up another question, does data_processing also only allow for batch_size=1? Currently I find data processing process seems does not support multi-gpu, so the...

how 14B T2V full training? on 80GB H100 gpu

> I use --use_gradient_checkpointing 我使用 --use_gradient_checkpointing --use_gradient_checkpointing_offload --training_strategy "deepspeed_stage_2" and can full fine-tune--training_strategy “deepspeed_stage_2” 并且可以完全微调 Hello, I also implemented lora and full training based on deepspeed_stage_2. But I found that...