diffusers RuntimeError: Input type (c10::Half) and bias type (float) should be the same

Describe the bug

  File "/usr/local/lib/python3.10/dist-packages/diffusers/pipelines/ledits_pp/pipeline_leditspp_stable_diffusion_xl.py", line 1422, in encode_image
    x0 = self.vae.encode(image).latent_dist.mode()
  File "/usr/local/lib/python3.10/dist-packages/diffusers/utils/accelerate_utils.py", line 46, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/autoencoders/autoencoder_kl.py", line 260, in encode
    h = self.encoder(x)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/autoencoders/vae.py", line 143, in forward
    sample = self.conv_in(sample)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/conv.py", line 460, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/conv.py", line 456, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Input type (c10::Half) and bias type (float) should be the same

Reproduction

import torch
import PIL
import requests
import io
from diffusers import LEditsPPPipelineStableDiffusionXL

pipe = LEditsPPPipelineStableDiffusionXL.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16)
pipe = pipe.to("cuda")

def download_image(url):
    response = requests.get(url)
    return PIL.Image.open(io.BytesIO(response.content)).convert("RGB")

img_url = "https://www.aiml.informatik.tu-darmstadt.de/people/mbrack/tennis.jpg"
image = download_image(img_url)

_ = pipe.invert(image=image, num_inversion_steps=50, skip=0.2)
edited_image = pipe(
    editing_prompt=["tennis ball", "tomato"],
    reverse_editing_direction=[True, False],
    edit_guidance_scale=[5.0, 10.0],
    edit_threshold=[0.9, 0.85],
).images[0]

Logs

No response

System Info

- `diffusers` version: 0.27.2
- Platform: Linux-5.10.0-28-cloud-amd64-x86_64-with-glibc2.35
- Python version: 3.10.12
- PyTorch version (GPU?): 2.4.0.dev20240428+cu121 (True)
- Huggingface_hub version: 0.22.2
- Transformers version: 4.36.2
- Accelerate version: 0.26.1
- xFormers version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

Who can help?

@linoytsaban @yiyixuxu @sayakpaul @DN6

May 03 '24 12:05 kadirnar

Hi Kadir, thanks for bringing up this issue! I noticed that a similar issue was also mentioned in a comment on the implementation PR after it had been merged: https://github.com/huggingface/diffusers/pull/6074#issuecomment-1995334003. I'd like to tag @manuelbrack here as well for his information. As I see, to ensure dtype alignment between image and vae in such situations, StableDiffusionXLReferencePipeline preferred this before vae's encoding: https://github.com/huggingface/diffusers/blob/58237364b1780223f48a80256f56408efe7b59a0/examples/community/stable_diffusion_xl_reference.py#L156-L160

May 03 '24 20:05 tolgacangoz

@standardAI would you like to open a PR to fix this?

May 03 '24 21:05 yiyixuxu