StableDiffusionUpscalePipeline cannot handle more than 1 image output

Open zookae opened this issue 3 years ago • 0 comments

Describe the bug

Using the StableDiffusionUpscalePipeline when called to output more than one image will crash. This occurs when using the num_images_per_prompt parameter or providing lists with multiple elements for the prompt and image inputs. Looks like some part of the logic for preparing the image and latents is not properly using the input batch size.

Reproduction

Using the example from the hugging face website with one modification:

import requests
from PIL import Image
from io import BytesIO
from diffusers import StableDiffusionUpscalePipeline
import torch

# load model and scheduler
model_id = "stabilityai/stable-diffusion-x4-upscaler"
pipeline = StableDiffusionUpscalePipeline.from_pretrained(model_id, revision="fp16", torch_dtype=torch.float16)
pipeline = pipeline.to("cuda")

# let's download an  image
url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd2-upscale/low_res_cat.png"
response = requests.get(url)
low_res_img = Image.open(BytesIO(response.content)).convert("RGB")
low_res_img = low_res_img.resize((128, 128))
prompt = "a white cat"
upscaled_image = pipeline(prompt=prompt, image=low_res_img, num_images_per_prompt=2).images[0]  # <-- modified to 2 outputs
upscaled_image.save("upsampled_cat.png")

Logs

Two alternatives for `upscaled_image = pipeline(prompt=prompt, image=low_res_img)` fail in different ways:

First option:
`upscaled_image = pipeline(prompt=prompt, image=low_res_img, num_images_per_prompt=2)`
Produces:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\<usr>\.conda\envs\test\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "C:\Users\<usr>\.conda\envs\test\lib\site-packages\diffusers\pipelines\stable_diffusion\pipeline_stable_diffusion_upscale.py", line 520, in __call__
    latent_model_input = torch.cat([latent_model_input, image], dim=1)
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 4 but got size 2 for tensor number 1 in the list.

Second option: upscaled_image = pipeline(prompt=[prompt]*2, image=[low_res_img]*2) Produces:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\<usr>\.conda\envs\test\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "C:\Users\<usr>\.conda\envs\test\lib\site-packages\diffusers\pipelines\stable_diffusion\pipeline_stable_diffusion_upscale.py", line 524, in __call__
    latent_model_input, t, encoder_hidden_states=text_embeddings, class_labels=noise_level
  File "C:\Users\<usr>\.conda\envs\test\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\<usr>\.conda\envs\test\lib\site-packages\diffusers\models\unet_2d_condition.py", line 332, in forward
    emb = emb + class_emb
RuntimeError: The size of tensor a (4) must match the size of tensor b (2) at non-singleton dimension 0



### System Info

- `diffusers` version: 0.9.0
- Platform: Windows-10-10.0.22621-SP0
- Python version: 3.7.15
- PyTorch version (GPU?): 1.13.0+cu117 (True)
- Huggingface_hub version: 0.11.0
- Transformers version: 4.24.0
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: no

Nov 29 '22 22:11 zookae