JianHan
JianHan
my result on 10-shot: AP for bird = 0.525 AP for bus = 0.158 AP for cow = 0.539 AP for motorbike = 0.454 AP for sofa = 0.029 Mean...
@minimini-1 @maggiesong7 In deed, the way you try to recons an image using VAR is incorrect. VAR formulates a next-scale prediciton task where **current scale prediciton is conditioned on previous...
@Leiii-Cao Powered by a CNN structure, VAE could encode and decode images with arbitrary resolution images. However, VAR only generates square images. Our recent work [Infinity](https://github.com/FoundationVision/Infinity) (text-to-image model for VAR)...
It's OK to slightly change the scale schedule for vqvae since it adopts a CNN architecture. It could still encodes and decodes images normally but with a slight performance drop....