Questions about input of VLDM
Excellent job!
I've got some questions about the code.
The input of VLDM is from batch_rgb(aka query_rgb)? But we didn't have the query_rgb at the training stage?
And where the z_scale_factor from? why it was fixed to be 0,18215?
Thanks!
Hi!
batch_rgb is a batch of RGB images. We encode batch_rgb into batch latents or images_z. Our model predicts the latent, images_z, and not the RGB image directly. We use a frozen VAE.
z_scale_factor is a hyperparameter from stable diffusion (they multiply the VAE latents by it before performing diffusion).
Thanks for your prompt reply!
Maybe I git it wrong. The objective of this paper is given 2 views, generate new views directly. So at the trainning stage, all we got is input_rgb(context view), and the batch_rgb(query view) is what we want to generate. How can it be used as input to vae encoder and diffusion model?
Or this framework is not instance-specific. We train the model on a lot of instances, and generalize to new instance at distillation stage? If so, can you explain a little bit more about it?
Thank you so much for your help!
Sorry I got another question, have you tried single view reconstruction on this framework?
Thanks for your help!
VLDM is not instance specific. It is trained on a lot of instances. During training, the model sees input_rgb, input_cameras, and query_cameras. We have ground truth for query_rgb, which we supervise the model with. Since we use a latent diffusion model, the model predicts latent codes internally.
Single view reconstruction requires a bit more careful consideration since the object scale in ambiguous.
Thanks for your patience. I may not ask my question in a right way.
This view which is batch_rgb in the code and will be fed into a vae encoder, is the query_vrgb? At the training stage, you use the ground truth query_rgb to the vae encoder, ad use the latent as x_0 of the diffusion model?
I've been trying your model on a custom dataset. you said batch_rgb is a batch of images. So I got out of dataloader for one iteration [1, B, H, W], can I understand this 4 dimension in the following: 1 --> batch_size of dataloader B --> sequence length or sample_batch_size pre_defined
So as the same, the camera parameters right out of dataloader should also be the same shape, for example: T --> [1, B, 3, 1] R --> [1, B, 3, 3] focal_length --> [1, B, 2, 2] principal_point --> [1, B, 2,2]
I've made my own dataset and dataloader to output data follow the above form, but everytime when need to compute the inverse transform, the code will end a error:
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)
Have you ever met this error in your experiment?
Usually cublas_status_not_initialized means the program or environment cannot find a cuda gpu.