onlyv
onlyv
During fine-tuning for depth estimateaion conditioned on input image, how to deal with the text prompt required in the original pre_trained text-to-image Stable Diffusion model?
 In this picture, why IL and IH share the same attention map generated by Mask extractor when IL and IH are unpaired.
Are the reference image and GT image during training from the same image of CelebA?
If the pseudo labels predicted by the teacher model are inaccurate, how does the student model obtain the correct information from the unlabeled data? Why does the student model outperform...
Before fine-tuning, the result is black. After fine-tuning, the result is still balck
How to solve this problem