diffusers
diffusers copied to clipboard
StableDiffusionUpscalePipeline cannot handle more than 1 image output
Describe the bug
Using the StableDiffusionUpscalePipeline when called to output more than one image will crash. This occurs when using the num_images_per_prompt parameter or providing lists with multiple elements for the prompt and image inputs. Looks like some part of the logic for preparing the image and latents is not properly using the input batch size.
Reproduction
Using the example from the hugging face website with one modification:
import requests
from PIL import Image
from io import BytesIO
from diffusers import StableDiffusionUpscalePipeline
import torch
# load model and scheduler
model_id = "stabilityai/stable-diffusion-x4-upscaler"
pipeline = StableDiffusionUpscalePipeline.from_pretrained(model_id, revision="fp16", torch_dtype=torch.float16)
pipeline = pipeline.to("cuda")
# let's download an image
url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd2-upscale/low_res_cat.png"
response = requests.get(url)
low_res_img = Image.open(BytesIO(response.content)).convert("RGB")
low_res_img = low_res_img.resize((128, 128))
prompt = "a white cat"
upscaled_image = pipeline(prompt=prompt, image=low_res_img, num_images_per_prompt=2).images[0] # <-- modified to 2 outputs
upscaled_image.save("upsampled_cat.png")
Logs
Two alternatives for `upscaled_image = pipeline(prompt=prompt, image=low_res_img)` fail in different ways:
First option:
`upscaled_image = pipeline(prompt=prompt, image=low_res_img, num_images_per_prompt=2)`
Produces:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\<usr>\.conda\envs\test\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "C:\Users\<usr>\.conda\envs\test\lib\site-packages\diffusers\pipelines\stable_diffusion\pipeline_stable_diffusion_upscale.py", line 520, in __call__
latent_model_input = torch.cat([latent_model_input, image], dim=1)
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 4 but got size 2 for tensor number 1 in the list.
Second option:
upscaled_image = pipeline(prompt=[prompt]*2, image=[low_res_img]*2)
Produces:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\<usr>\.conda\envs\test\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "C:\Users\<usr>\.conda\envs\test\lib\site-packages\diffusers\pipelines\stable_diffusion\pipeline_stable_diffusion_upscale.py", line 524, in __call__
latent_model_input, t, encoder_hidden_states=text_embeddings, class_labels=noise_level
File "C:\Users\<usr>\.conda\envs\test\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\<usr>\.conda\envs\test\lib\site-packages\diffusers\models\unet_2d_condition.py", line 332, in forward
emb = emb + class_emb
RuntimeError: The size of tensor a (4) must match the size of tensor b (2) at non-singleton dimension 0
### System Info
- `diffusers` version: 0.9.0
- Platform: Windows-10-10.0.22621-SP0
- Python version: 3.7.15
- PyTorch version (GPU?): 1.13.0+cu117 (True)
- Huggingface_hub version: 0.11.0
- Transformers version: 4.24.0
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: no