diffusers Is it good to train on 512x512, inference on 512x768?

Thank you for your great job. It is really useful for me. I want to inference with resolution of 512x768. I know I could do that with the model trained on 512x512. But to get the best performance on both 512x512 and 512x768, should I trained on 512x512 or 512x768 or 768x768. I'd appreciate your advise.

Feb 20 '23 07:02 tengshaofeng

In our experiments, if there is a resolution discrepancy, then the performance usually degrades quite a bit i.e., if you fine-tune on a custom dataset with a different resolution, then the quality of the generated images might not be on par. Usually, the number of training images seen during fine-tuning dictates this performance.

If an upscaled resolution is a requirement for you, would mind trying out the latent upscaler model we recently introduced? You can find an application of it in this Space: https://huggingface.co/spaces/huggingface-projects/stable-diffusion-latent-upscaler/blob/main/app.py.

Cc: @yiyixuxu

Feb 20 '23 07:02 sayakpaul

@sayakpaul ,thanks so much for your reply. So if I want to finetune the sd-1.5 with 200k images, and most of the image is resolution of 512x768. In accordance with what you said, to get the best performance on 512x768, maybe training on 512x768 is better? BTW, thanks for your latent upscaler model, I will try it later.

Feb 20 '23 08:02 tengshaofeng

Yes, sure, worth giving it a try but I just wanted to share our experience to make you aware of the poor results that might arise :)

Feb 20 '23 08:02 sayakpaul

@sayakpaul I try the app of upscaler, it is not good yet, maybe worse than real-esrgan. Did the result I show is right? prompt="(portrait:1.0),face in the center, Mage godess with white hair and mage god with black hair, pale skin, fantasy, in love, couple, hug each other, sharp focus, intricate, elegant, illustration, ambient lighting, art by stefanie law, qistina khalidah, tranding on artstation, art by luis royo higly detailed studio lighting"

Feb 20 '23 08:02 tengshaofeng

It's happening probably because of the discrepancy between the training data. So, I guess your best bet for now is to fine-tune the model or use something like MultiDiffusion. Cc: @omerbt

Feb 20 '23 08:02 sayakpaul

@sayakpaul Thanks so much.

Feb 20 '23 08:02 tengshaofeng

Hi! indeed as @sayakpaul mentioned, even though StableDiffusion can technically process higher resolution images, we observed that many times it produces poor quality outputs (it is out-of-distribution w.r.t its training data).MultiDiffusion tackles this and allows to generate high-quality images at arbitrary aspect ratio. See this documentaion for how to use it through diffusers.

Feb 20 '23 18:02 omerbt

@tengshaofeng,

You can also definitely try out to just directly generate larger images by setting height and width - this works quite well sometimes. See: https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/text2img#diffusers.StableDiffusionPipeline.call.height

Mar 06 '23 11:03 patrickvonplaten

@omerbt @patrickvonplaten thanks for your reply,guys. I learned so much. Thanks again.

Mar 14 '23 11:03 tengshaofeng

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Apr 07 '23 15:04 github-actions[bot]