sparsefusion Questions about input of VLDM

Excellent job! I've got some questions about the code. The input of VLDM is from batch_rgb(aka query_rgb)? But we didn't have the query_rgb at the training stage? And where the z_scale_factor from? why it was fixed to be 0,18215? Thanks!

Oct 16 '23 14:10 zhaojiancheng007

Hi!

batch_rgb is a batch of RGB images. We encode batch_rgb into batch latents or images_z. Our model predicts the latent, images_z, and not the RGB image directly. We use a frozen VAE.

z_scale_factor is a hyperparameter from stable diffusion (they multiply the VAE latents by it before performing diffusion).

Oct 16 '23 19:10 zhizdev

Thanks for your prompt reply!

Maybe I git it wrong. The objective of this paper is given 2 views, generate new views directly. So at the trainning stage, all we got is input_rgb(context view), and the batch_rgb(query view) is what we want to generate. How can it be used as input to vae encoder and diffusion model?

Or this framework is not instance-specific. We train the model on a lot of instances, and generalize to new instance at distillation stage? If so, can you explain a little bit more about it?

Thank you so much for your help!

Oct 17 '23 01:10 zhaojiancheng007

Sorry I got another question, have you tried single view reconstruction on this framework?

Thanks for your help!

Oct 17 '23 01:10 zhaojiancheng007

VLDM is not instance specific. It is trained on a lot of instances. During training, the model sees input_rgb, input_cameras, and query_cameras. We have ground truth for query_rgb, which we supervise the model with. Since we use a latent diffusion model, the model predicts latent codes internally.

Single view reconstruction requires a bit more careful consideration since the object scale in ambiguous.

Oct 17 '23 02:10 zhizdev

Thanks for your patience. I may not ask my question in a right way. Untitled

This view which is batch_rgb in the code and will be fed into a vae encoder, is the query_vrgb? At the training stage, you use the ground truth query_rgb to the vae encoder, ad use the latent as x_0 of the diffusion model?

I've been trying your model on a custom dataset. you said batch_rgb is a batch of images. So I got out of dataloader for one iteration [1, B, H, W], can I understand this 4 dimension in the following: 1 --> batch_size of dataloader B --> sequence length or sample_batch_size pre_defined

So as the same, the camera parameters right out of dataloader should also be the same shape, for example: T --> [1, B, 3, 1] R --> [1, B, 3, 3] focal_length --> [1, B, 2, 2] principal_point --> [1, B, 2,2]

I've made my own dataset and dataloader to output data follow the above form, but everytime when need to compute the inverse transform, the code will end a error:

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle) Untitled (1) Untitled (2) Have you ever met this error in your experiment?

Oct 20 '23 18:10 zhaojiancheng007

Usually cublas_status_not_initialized means the program or environment cannot find a cuda gpu.

Oct 20 '23 21:10 zhizdev