generative-models icon indicating copy to clipboard operation
generative-models copied to clipboard

The vae encoder of the first_stage_model

Open forgetable233 opened this issue 1 year ago • 4 comments

I'm using the sv3d_p model. I noticed that the vae encoder of the first_stage_model is not provided in the ckpt. I wonder what's the vae encoder of the first_stage_model while training?

forgetable233 avatar Apr 10 '24 12:04 forgetable233

Same question.

JiuTongBro avatar Apr 27 '24 07:04 JiuTongBro

Hi guys, i'm also focus on this. It seems that sv3d use the same encoder and decoder as svd, while svd's encoder is released on huggingface. You can refer to: https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt/tree/main/vae for the ckpt, https://github.com/huggingface/diffusers/blob/v0.24.0-release/src/diffusers/models/autoencoder_kl_temporal_decoder.py for the model code, and https://github.com/huggingface/diffusers/blob/v0.24.0-release/src/diffusers/pipelines/stable_video_diffusion/pipeline_stable_video_diffusion.py for how to use.

pengc02 avatar May 12 '24 15:05 pengc02

@pengc02 thx! It helps.

chenshuo20 avatar Jul 28 '24 21:07 chenshuo20

Hi guys, i'm also focus on this. It seems that sv3d use the same encoder and decoder as svd, while svd's encoder is released on huggingface. You can refer to: https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt/tree/main/vae for the ckpt, https://github.com/huggingface/diffusers/blob/v0.24.0-release/src/diffusers/models/autoencoder_kl_temporal_decoder.py for the model code, and https://github.com/huggingface/diffusers/blob/v0.24.0-release/src/diffusers/pipelines/stable_video_diffusion/pipeline_stable_video_diffusion.py for how to use.

Also I find that you can use such config to load the vae model:

vae_encoder_config:
      target: src.diffusers.models.autoencoders.autoencoder_kl_temporal_decoder.AutoencoderKLTemporalDecoder
      params:
        block_out_channels: [128, 256, 512, 512]
        layers_per_block: 2
        in_channels: 3
        out_channels: 3
        down_block_types: ["DownEncoderBlock2D", "DownEncoderBlock2D", "DownEncoderBlock2D", "DownEncoderBlock2D"]

chenshuo20 avatar Jul 28 '24 23:07 chenshuo20