How is the latent image (64x64) used in tutorial?
Appreciate the author's great great work.
My question is about the latent image (compressed from 512x512 original image). In author's paper, it mentioned that ControlNet trained a new lite CNN to compress the original image(like depth map / seg map / edge map?) to latent image. Is that crucial when we training on a customed dataset? Because I don't see any related information about training the lite CNN in the example shown in tutorial. Should we retrain a new CNN to obtain latent image on new dataset?
This is the model used for encoding new control information into latent space. https://github.com/lllyasviel/ControlNet/blob/main/cldm/cldm.py#L47-L304
And this control information in the bigger ControlLDM. https://github.com/lllyasviel/ControlNet/blob/main/cldm/cldm.py#L311 https://github.com/lllyasviel/ControlNet/blob/main/cldm/cldm.py#L333 https://github.com/lllyasviel/ControlNet/blob/main/cldm/cldm.py#L333
This lite CNN is trained during training.
thanks @xiankgx for your detailed answer! That's super helpful for me.