DiffSynth-Studio icon indicating copy to clipboard operation
DiffSynth-Studio copied to clipboard

Use DiffSynth-Studio to train i2v model based one wanx1.3 t2v model

Open lith0613 opened this issue 10 months ago • 3 comments

Thank you for providing a very sleek and user-friendly diffusion framework. I’m currently trying to fine-tune the 14B i2v model, but there’s not enough VRAM. Is it possible to import the 1.3B t2v weights into this framework and then train the i2v model? I’ve noticed that the model_manager = ModelManager(torch_dtype=torch.bfloat16, device="cpu") is used to load predefined structured pretrained model weights, which doesn’t seem very convenient for defining a model structure and then importing partial parameters, such as defining an i2v structure and importing t2v weights. Here’s the code I looked at:

` model_manager = ModelManager(torch_dtype=torch.bfloat16, device="cpu")

model_manager.load_models( [ "Wan2.1-T2V-1.3B/diffusion_pytorch_model.safetensors", "Wan2.1-T2V-1.3B/models_t5_umt5-xxl-enc-bf16.pth", "Wan2.1-T2V-1.3B/Wan2.1_VAE.pth", ])

self.pipe = WanVideoPipeline.from_model_manager(model_manager) `

This framework only supports the import of pretrained model weights with a defined structure, which is not very convenient for predefining a model structure and then importing partial parameters, such as defining an i2v structure and importing t2v weights.

lith0613 avatar Mar 10 '25 09:03 lith0613

@lith0613 If you wish to train the 1.3B t2v model into an i2v model, we recommend that you take a close look at our backend code and make modifications as needed.

We did not expose the model structure in the sample code because, in most cases, we cannot guide users on which model structure they should load themselves.

Additionally, training the i2v model is highly resource-intensive, and we suggest that you attempt it only if you have access to hundreds of GPUs.

Artiprocher avatar Mar 10 '25 10:03 Artiprocher

@lith0613 If you wish to train the 1.3B t2v model into an i2v model, we recommend that you take a close look at our backend code and make modifications as needed.

We did not expose the model structure in the sample code because, in most cases, we cannot guide users on which model structure they should load themselves.

Additionally, training the i2v model is highly resource-intensive, and we suggest that you attempt it only if you have access to hundreds of GPUs.

Thank you very much for your response. I have a question regarding your I2V model compared to the T2V model. It seems that a lot of new parameters have been added, especially in the image cross attention part. For the original T2V model, these newly initialized parameters would likely impair the generative capabilities of the pre-trained T2V model. Could you explain the rationale behind this design choice? Additionally, if I wanted to train this I2V model from scratch, how could I maintain the generative abilities of the original T2V model?

lith0613 avatar Mar 11 '25 07:03 lith0613

@lith0613 This is a difficult question to answer. The model structure was trained by the Wan team, adding channels to the T2V model, and this design is consistent with Stable Video Diffusion. We speculate that in the early stages of I2V training, the new parameters are initialized in the form of adapters, similar to the zero convolution in ControlNet, in order to avoid impacting the original capabilities of the model.

Artiprocher avatar Mar 14 '25 02:03 Artiprocher