diffusers icon indicating copy to clipboard operation
diffusers copied to clipboard

[BUG] convert T5 text encoder to float16 results corrupted image

Open Luciennnnnnn opened this issue 1 year ago • 13 comments

Describe the bug

I have tested PixArt-Sigma with following code, where I load text_encoder separately since I will fine-tune it in later. I found T5EncoderModel.from_pretrained(torch_dtype=torch.float16) is very different from T5EncoderModel.from_pretrained().to(dtype=torch.float16), the later one produces corrupted images.

What's happening when we pass torch_dtype argument to from_pretrained?

Reproduction

from diffusers import PixArtSigmaPipeline
import torch

from transformers import T5EncoderModel


# text_encoder = T5EncoderModel.from_pretrained("PixArt-alpha/PixArt-Sigma-XL-2-1024-MS", subfolder="text_encoder", torch_dtype=torch.float16) # good result
text_encoder = T5EncoderModel.from_pretrained("PixArt-alpha/PixArt-Sigma-XL-2-1024-MS", subfolder="text_encoder").to(dtype=torch.float16) # noise

pipe = PixArtSigmaPipeline.from_pretrained(
    "PixArt-alpha/PixArt-Sigma-XL-2-1024-MS",
    text_encoder=text_encoder,
    torch_dtype=torch.float16
)

pipe = pipe.to("cuda")

prompts = "a space elevator, cinematic scifi art"

for idx, prompt in enumerate(prompts):
    image = pipe(prompt=prompt, num_inference_steps=50, generator=torch.manual_seed(1)).images[0]
    image.save("x.png")

Logs

No response

System Info

  • 🤗 Diffusers version: 0.29.0
  • Platform: Linux-5.15.0-60-generic-x86_64-with-glibc2.35
  • Running on a notebook?: No
  • Running on Google Colab?: No
  • Python version: 3.10.11
  • PyTorch version (GPU?): 2.1.2+cu118 (True)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Huggingface_hub version: 0.23.3
  • Transformers version: 4.41.2
  • Accelerate version: 0.23.0
  • PEFT version: 0.7.0
  • Bitsandbytes version: not installed
  • Safetensors version: 0.4.2
  • xFormers version: 0.0.23.post1+cu118
  • Accelerator: NVIDIA A100-SXM4-80GB, 81920 MiB NVIDIA A100-SXM4-80GB, 81920 MiB NVIDIA A100-SXM4-80GB, 81920 MiB NVIDIA A100-SXM4-80GB, 81920 MiB NVIDIA A100-SXM4-80GB, 81920 MiB NVIDIA A100-SXM4-80GB, 81920 MiB NVIDIA A100-SXM4-80GB, 81920 MiB NVIDIA A100-SXM4-80GB, 81920 MiB VRAM
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help?

@sayakpaul @yiyixuxu

Luciennnnnnn avatar Jun 17 '24 14:06 Luciennnnnnn

Can you check if the outputs of the text encoders vary when loaded using the method you described?

That will be an easier way to reproduce the problem.

Cc: @lawrence-cj

sayakpaul avatar Jun 17 '24 14:06 sayakpaul

Interesting, I can reproduce this error, these are the outputs:

# text_encoder = T5EncoderModel.from_pretrained(...).to(dtype=torch.float16)
tensor([[[ 0.0872, -0.0144, -0.0733,  ...,  0.0432,  0.0251,  0.1550],
         [ 0.0277, -0.1429, -0.1173,  ...,  0.0565, -0.1959,  0.0936],
         [-0.0569,  0.1390, -0.1050,  ...,  0.0665,  0.0408,  0.1098]]],
       device='cuda:0', dtype=torch.float16, grad_fn=<MulBackward0>)
# text_encoder = T5EncoderModel.from_pretrained(..., torch_dtype=torch.float16)
tensor([[[-1.2744e-01, -1.4755e-02, -6.3416e-02,  ...,  1.0626e-01,
          -3.7567e-02, -1.1975e-01],
         [-1.1462e-01,  6.1569e-03,  1.1475e-01,  ..., -3.8208e-02,
          -1.1078e-01, -1.0980e-01],
         [-5.2605e-03, -7.7438e-03,  3.5763e-06,  ..., -3.6888e-03,
           7.2136e-03,  2.2907e-03]]], device='cuda:0', dtype=torch.float16,
       grad_fn=<MulBackward0>)

I found that with the the second one, some layers are still torch.float32.

asomoza avatar Jun 17 '24 15:06 asomoza

Ccing a Transformers maintainer here: @ArthurZucker

sayakpaul avatar Jun 17 '24 15:06 sayakpaul

I think for t5, certain layers are upcasted to float32 when load the checkpoint with from_pretrained in fp16 https://github.com/huggingface/transformers/blob/9ba9369a2557e53a01378199a9839ec6e82d8bc7/src/transformers/models/t5/modeling_t5.py#L797

yiyixuxu avatar Jun 18 '24 00:06 yiyixuxu

If certain layers need to be upcasted to float32, is the training code of SD3 correct? In the training code of SD3, the T5 text encoder is initially loaded in float32 and then converted to float16 using the to() method when employing mixed precision training with fp16. We do not appear to encounter similar issues when loading the T5 text encoder in SD3. Could this be due to differences between the T5 encoder utilized in PixArt-Sigma and the one in SD3?

Luciennnnnnn avatar Jun 18 '24 01:06 Luciennnnnnn

Training does not seem to be affected by this :/

sayakpaul avatar Jun 18 '24 06:06 sayakpaul

Training does not seem to be affected by this :/

Why? If some parameters of T5 have to be in float32, it will cause flow transformer get inferior text features

Luciennnnnnn avatar Jun 18 '24 07:06 Luciennnnnnn

Could very well be but the qualitative samples haven’t told me that yet.

This needs a deeper investigation. But the problem could stem from the fact that the original checkpoints are in float16 and I am not exactly sure about the consequences of any kind of casting here yet.

sayakpaul avatar Jun 18 '24 07:06 sayakpaul

@Luciennnnnnn, can you run the same experiment for sd3 to see if you also see if it also produces a worse image in fp16? https://github.com/huggingface/diffusers/issues/8604#issue-2357443101

t5 embeddings is used differently in sd3 and pixart so it is possible it has less or no effect in sd3. But we were not aware that these layers in t5 need to be in fp32 before, so it's not impossible the training could work better for sd3 if we do that.

yiyixuxu avatar Jun 18 '24 22:06 yiyixuxu

a quick test for sd3 here - fp16 (bottom row) seems ok?

import torch
from diffusers import StableDiffusion3Pipeline
from transformers import T5EncoderModel

repo = "stabilityai/stable-diffusion-3-medium-diffusers"
dtype = torch.float16

pipe = StableDiffusion3Pipeline.from_pretrained(repo, torch_dtype=dtype)
pipe.enable_model_cpu_offload()
print(pipe.text_encoder_3.encoder.block[11].layer[1].DenseReluDense.wo.weight.dtype)
out = []
generator = torch.Generator(device="cpu").manual_seed(0)
for i in range(2):
    image = pipe(
        "A cat holding a sign that says hello world",
        negative_prompt="",
        num_inference_steps=28,
        guidance_scale=7.0,
        generator=generator,
    ).images[0]
    out.append(image)
pipe.text_encoder_3 = pipe.text_encoder_3.to(dtype)
print(pipe.text_encoder_3.encoder.block[11].layer[1].DenseReluDense.wo.weight.dtype)
generator = torch.Generator(device="cpu").manual_seed(0)
for i in range(2):
    image = pipe(
        "A cat holding a sign that says hello world",
        negative_prompt="",
        num_inference_steps=28,
        guidance_scale=7.0,
        generator=generator,
    ).images[0]
    out.append(image)



from diffusers.utils import make_image_grid
make_image_grid(out, rows=2, cols=2).save("yiyi_test_1_out.png")

yiyi_test_1_out

yiyixuxu avatar Jun 19 '24 07:06 yiyixuxu

How do we want to go about this? Should we maybe document this issue to start with? Gently ping @asomoza and @yiyixuxu about this.

sayakpaul avatar Jun 29 '24 13:06 sayakpaul

I did also a test with SD3, it has the same problem but the clip text encoders save the generation.

Also did the same test with the original T5 and this also happens if I use it for generating embedddings but when I do inference it works with both methods.

Probably the best solution is to make that the embeddings stay the same when we do T5EncoderModel.from_pretrained(...).to(dtype=torch.float16) but that's something that has to be done in the transformers side.

Since that could take more time, I also think that we should add to the docs that for the T5 Text Encoders, users can't do T5EncoderModel.from_pretrained(...).to(dtype=torch.float16)

asomoza avatar Jun 30 '24 01:06 asomoza

yes I should we should document about this :)

yiyixuxu avatar Jul 01 '24 17:07 yiyixuxu

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Sep 14 '24 15:09 github-actions[bot]

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Dec 14 '24 15:12 github-actions[bot]