can it do super resolution?
Can VAR do super resolution like GigaGan super resolution for example. Gigagan is the most impressive super resolution algorithm till now. And if yes would you be able to add support for it Later next month or so?
VAR supports zero-shot super resolution. Although it might not rival the GigaGan upsampler, we're planning to release a demo for testing in the coming days. Stay tuned for updates!
Hi keyu,
I'm Bingyue's friend and I'm very impressed with this work!
I have a question regarding large size image with super high resolution.
First, let me try to understand the fundamental logic. Correct me if I'm wrong.
- The basic idea is to establish a self-supervised learning mechanism. In VAR, we follow the process as raw img -> embedding f -> forward: (r_K -> ... -> r_1) -> backward: (r_1 -> ... -> r_K) -> recovered embedding f^ -> reconstructed img
i.e. from fine to coarse and then inversely from coarse to fine
- The learning is based on the probabilistic generative model for the conditional generation probabilities: P(r_k | r_(k-1), ..., r_1) for k = 1, ..., K, with r_0 as a pre-defined start (, i.e. guidence).
Based on this understanding, regarding a large size image with super high resolution, we can set the dimension of the the embedding vector f to be higher for more representation capability.
Considering the main-stream techniques like the one in the paper âScalable Diffusion Models with Transformersâ, one technique is to âpatchifyâ the raw image into patches (i.e. tokens) and then find the âbestâ embedding of each patch by a transformer-architecture based learning. When each token embedding is decoded back into a âpredictedâ patch, then all the âpredictedâ patches can be re-organized together to recover the whole image.
Now, the QUESTION is: can we also do the âpatchifyâ and then apply the fineâcoarseâfine process to each patch and then reorganize the âpredictedâ patches to recover the whole image?
Not quite sure which of the two method is better. I mean a) setting dimension of the the embedding vector f to be higher for more representation capability b) patchifing the raw image into patches, working on each patch, and the piecing together the âpredictedâ patches
One intuition for the âpatchifyâ in method b) is that there could be some un-smoothed piecing-together when the computation for the optimization process is not enough yet. Note that breaking a whole image into pieces actually destroys the spatial connection information of the pieces. The method a) does not need to deal with the problem of piecing-together because the embedding is regarding the whole image.
Best,
Xugang Ye
@judywxy1122 Thank you for your kind words! The question is a bit detailed; let me give it some thought and I'll get back to you shortly.