diffusers Tensor size mismatch for non pow of 2 sized image, SD3ControlNetModel

Describe the bug

There seems to be an issue with certain non power of 2 sized control net guidance images when using SD3ControlNetModel

Reproduction

import diffusers
import PIL.Image
import os

import torch

os.environ['HF_TOKEN'] = 'your token'

cn = diffusers.SD3ControlNetModel.from_pretrained('InstantX/SD3-Controlnet-Canny')

pipe = diffusers.StableDiffusion3ControlNetPipeline.from_pretrained(
    'stabilityai/stable-diffusion-3-medium-diffusers',
     controlnet=cn)

pipe.enable_sequential_cpu_offload()

# aligned by 8, not a power of 2
output_size = (1376, 920)

not_pow_2 = PIL.Image.new('RGB', output_size)

args = {
    'guidance_scale': 8.0,
    'num_inference_steps': 30,
    'width': output_size[0],
    'height': output_size[1],
    'control_image': not_pow_2,
    'prompt': 'test prompt'
}

pipe(**args)

Logs

REDACT\venv\Lib\site-packages\diffusers\models\attention_processor.py:1584: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.)
  hidden_states = F.scaled_dot_product_attention(
  0%|          | 0/30 [00:49<?, ?it/s]
Traceback (most recent call last):
  File "REDACT\test.py", line 37, in <module>
    pipe(**args)
  File "REDACT\venv\Lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "REDACT\venv\Lib\site-packages\diffusers\pipelines\controlnet_sd3\pipeline_stable_diffusion_3_controlnet.py", line 1020, in __call__
    latents = self.scheduler.step(noise_pred, t, latents, return_dict=False)[0]
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "REDACT\venv\Lib\site-packages\diffusers\schedulers\scheduling_flow_match_euler_discrete.py", line 268, in step
    denoised = sample - model_output * sigma
               ~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
RuntimeError: The size of tensor a (115) must match the size of tensor b (114) at non-singleton dimension 2

System Info

Platform: Windows

Python 3.12.3 diffusers 0.29.1 transformers 4.41.2 accelerate 0.31.0

Who can help?

@sayakpaul @yiyixuxu @DN6

Jun 23 '24 19:06 Teriks

Does the regular SD3 pipeline work with this resolution?

Cc: @ResearcherXman

Jun 24 '24 01:06 sayakpaul

now that you mention, it seems that it does not, possibly related to https://github.com/huggingface/diffusers/issues/8668

In that issue, input images that are not aligned by 16 for the img2img pipeline cause this issue to crop up in the scheduler.

Power of 2 images seem to work consistently when a control net is involved, for the img2img and txt2img pipeline, resolutions aligned to 16 appear to work consistently.

import diffusers
import os

os.environ['HF_TOKEN'] = 'your token'


pipe = diffusers.StableDiffusion3Pipeline.from_pretrained(
    'stabilityai/stable-diffusion-3-medium-diffusers')

pipe.enable_sequential_cpu_offload()

# aligned by 8, not a power of 2
output_size = (1376, 920)


args = {
    'guidance_scale': 8.0,
    'num_inference_steps': 30,
    'width': output_size[0],
    'height': output_size[1],
    'prompt': 'test prompt'
}

pipe(**args)

Result:

  0%|          | 0/30 [00:00<?, ?it/s]REDACT\venv\Lib\site-packages\diffusers\models\attention_processor.py:1135: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.)
  hidden_states = hidden_states = F.scaled_dot_product_attention(
  0%|          | 0/30 [00:46<?, ?it/s]
Traceback (most recent call last):
  File "REDACT\test.py", line 24, in <module>
    pipe(**args)
  File "REDACT\venv\Lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "REDACT\venv\Lib\site-packages\diffusers\pipelines\stable_diffusion_3\pipeline_stable_diffusion_3.py", line 862, in __call__
    latents = self.scheduler.step(noise_pred, t, latents, return_dict=False)[0]
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "REDACT\venv\Lib\site-packages\diffusers\schedulers\scheduling_flow_match_euler_discrete.py", line 268, in step
    denoised = sample - model_output * sigma
               ~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
RuntimeError: The size of tensor a (115) must match the size of tensor b (114) at non-singleton dimension 2

If you specify the nearest value aligned to 16 it works

output_size = (1376-(1376%16), 920-(920%16))

Jun 24 '24 02:06 Teriks

I always thought this was kind of a universal knowledge, but you can't just put any resolution, I don't remember all of them, but mostly you need to use resolutions that are multiple of 8 for most of the SD models.

Also in the official ComfyUI workflow for SD3, they put this note:

So if there's no information about this in the docs, probably need to add it.

cc: @stevhliu

Jun 24 '24 06:06 asomoza

The pipeline itself throws an error noting you are not using a multiple of 8 dimension if you do not use a dimension that is a multiple of 8. It is failing here on a multiple of 8. Though perhaps it needs some other alignment.

There is an incongruency in the division that determines the size of the prepared latents here: https://github.com/huggingface/diffusers/blob/f1f542bdd4412a97ac109dab22959a8a575f3a7e/src/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3.py#L588

And the division that determines the size of the tensor returned by the transformer here starting here:

https://github.com/huggingface/diffusers/blob/f1f542bdd4412a97ac109dab22959a8a575f3a7e/src/diffusers/models/transformers/transformer_sd3.py#L378

For this value (1376, 920) which is aligned by eight.

Height: (int(height) // self.vae_scale_factor) or rather (int(920) // 8), is divided with round/truncation by the patch size 2 resulting in 115//2 = 57

Information is lost in the integer round IE. .5 is lost as 115.0 / 2.0 would be 57.5

In the multiplication height * patch_size which creates the dimension of the transformer output, the result is then 114, which is a mismatched dimension with the latents tensor

        # unpatchify
        patch_size = self.config.patch_size
        height = height // patch_size
        width = width // patch_size
  
        hidden_states = hidden_states.reshape(
            shape=(hidden_states.shape[0], height, width, patch_size, patch_size, self.out_channels)
        )
        hidden_states = torch.einsum("nhwpqc->nchpwq", hidden_states)
        output = hidden_states.reshape(
            shape=(hidden_states.shape[0], self.out_channels, height * patch_size, width * patch_size)
        )

This is what causes these two tensors to be mismatched in dimension.

From this it somewhat seems that the proper required alignment would be 16, or it could be 64, though it is not documented or enforced by the pipeline via an exception.

It seems definitely not related to power of 2 in the basic use case.

For instance, alignment by 16 with a controlnet works, contrary to what I mentioned before, I probably miss keyed a number.

The resulting height of the resolution 912 is not aligned by 64, and it works, though it could be a fluke.

import diffusers
import PIL.Image
import os

os.environ['HF_TOKEN'] = 'your token'

cn = diffusers.SD3ControlNetModel.from_pretrained('InstantX/SD3-Controlnet-Canny')

pipe = diffusers.StableDiffusion3ControlNetPipeline.from_pretrained(
    'stabilityai/stable-diffusion-3-medium-diffusers',
    controlnet=cn)

pipe.enable_sequential_cpu_offload()

width = 1376
height = 920

# aligned by 16
output_size = (width-(width % 16), height-(height % 16))

not_pow_2 = PIL.Image.new('RGB', output_size)

args = {
    'guidance_scale': 8.0,
    'num_inference_steps': 30,
    'width': output_size[0],
    'height': output_size[1],
    'control_image': not_pow_2,
    'prompt': 'test prompt'
}

pipe(**args)

Jun 24 '24 07:06 Teriks

In addition it seems that alignments of 16 or even 64 cause issues with VAE tiling, however power of 2 sized images succeed, which is what I may have mixed up in this issue. This seems to occur with and without a controlnet.

I have closed the other issue as it is basically the same issue.

import diffusers
import PIL.Image
import os

os.environ['HF_TOKEN'] = 'your token'

cn = diffusers.SD3ControlNetModel.from_pretrained('InstantX/SD3-Controlnet-Canny')

pipe = diffusers.StableDiffusion3ControlNetPipeline.from_pretrained(
    'stabilityai/stable-diffusion-3-medium-diffusers',
    controlnet=cn)

pipe.enable_sequential_cpu_offload()

pipe.vae.enable_tiling()

width = 1376
height = 920

# aligned by 16, but alignment by 64 also fails
output_size = (width-(width % 16), height-(height % 16))

not_pow_2 = PIL.Image.new('RGB', output_size)

args = {
    'guidance_scale': 8.0,
    'num_inference_steps': 30,
    'width': output_size[0],
    'height': output_size[1],
    'control_image': not_pow_2,
    'prompt': 'test prompt'
}

pipe(**args)

Fail:

REDACT\venv\Lib\site-packages\diffusers\models\attention_processor.py:1584: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.)
  hidden_states = F.scaled_dot_product_attention(
Traceback (most recent call last):
  File "REDACT\test.py", line 35, in <module>
    pipe(**args)
  File "REDACT\venv\Lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "REDACT\venv\Lib\site-packages\diffusers\pipelines\controlnet_sd3\pipeline_stable_diffusion_3_controlnet.py", line 912, in __call__
    control_image = self.vae.encode(control_image).latent_dist.sample()
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "REDACT\venv\Lib\site-packages\diffusers\utils\accelerate_utils.py", line 46, in wrapper
    return method(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "REDACT\venv\Lib\site-packages\diffusers\models\autoencoders\autoencoder_kl.py", line 258, in encode
    return self.tiled_encode(x, return_dict=return_dict)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "REDACT\venv\Lib\site-packages\diffusers\models\autoencoders\autoencoder_kl.py", line 363, in tiled_encode
    tile = self.quant_conv(tile)
           ^^^^^^^^^^^^^^^^^^^^^
TypeError: 'NoneType' object is not callable

This code succeeds:

import diffusers
import PIL.Image
import os

os.environ['HF_TOKEN'] = 'your token'

cn = diffusers.SD3ControlNetModel.from_pretrained('InstantX/SD3-Controlnet-Canny')

pipe = diffusers.StableDiffusion3ControlNetPipeline.from_pretrained(
    'stabilityai/stable-diffusion-3-medium-diffusers',
    controlnet=cn)

pipe.enable_sequential_cpu_offload()

pipe.vae.enable_tiling()

width = 1024
height = 1024

output_size = (width, height)

pow_2 = PIL.Image.new('RGB', output_size)

args = {
    'guidance_scale': 8.0,
    'num_inference_steps': 30,
    'width': output_size[0],
    'height': output_size[1],
    'control_image': pow_2,
    'prompt': 'test prompt'
}

pipe(**args)

Jun 24 '24 08:06 Teriks

Edited these due to late night mix up between WxH and HxW

Jun 24 '24 19:06 Teriks

The normal pipeline presents with a different error with VAE tiling enabled, after denoising completes.

It takes quite a while to run these (for me) at 30 inference steps, I have reduced it to 3 here.

This is the source of my original confusion with a possible power of 2 requirement which I had not accounted for in my code (vae tiling being on), though it seems the pipelines work with an alignment of 16 otherwise. Just not with VAE tiling, which only works with a power of 2 currently.

import diffusers
import os

os.environ['HF_TOKEN'] = 'your token'

pipe = diffusers.StableDiffusion3Pipeline.from_pretrained(
    'stabilityai/stable-diffusion-3-medium-diffusers')

pipe.enable_sequential_cpu_offload()

pipe.vae.enable_tiling()

width = 1376
height = 920

# aligned by 16, but alignment by 64 also fails
output_size = (width-(width % 16), height-(height % 16))

args = {
    'guidance_scale': 8.0,
    'num_inference_steps': 3,
    'width': output_size[0],
    'height': output_size[1],
    'prompt': 'test prompt'
}

pipe(**args)

Log:

Will try to load from local cache.
Loading pipeline components...:   0%|          | 0/9 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:  50%|█████     | 1/2 [00:04<00:04,  4.34s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:09<00:00,  4.56s/it]
Loading pipeline components...:  78%|███████▊  | 7/9 [00:11<00:01,  1.06it/s]You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading pipeline components...: 100%|██████████| 9/9 [00:45<00:00,  5.01s/it]
  0%|          | 0/3 [00:00<?, ?it/s]REDACT\venv\Lib\site-packages\diffusers\models\attention_processor.py:1135: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.)
  hidden_states = hidden_states = F.scaled_dot_product_attention(
100%|██████████| 3/3 [01:16<00:00, 25.65s/it]
Traceback (most recent call last):
  File "REDACT\test.py", line 27, in <module>
    pipe(**args)
  File "REDACT\venv\Lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "REDACT\venv\Lib\site-packages\diffusers\pipelines\stable_diffusion_3\pipeline_stable_diffusion_3.py", line 895, in __call__
    image = self.vae.decode(latents, return_dict=False)[0]
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "REDACT\venv\Lib\site-packages\diffusers\utils\accelerate_utils.py", line 46, in wrapper
    return method(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "REDACT\venv\Lib\site-packages\diffusers\models\autoencoders\autoencoder_kl.py", line 314, in decode
    decoded = self._decode(z).sample
              ^^^^^^^^^^^^^^^
  File "REDACT\venv\Lib\site-packages\diffusers\models\autoencoders\autoencoder_kl.py", line 280, in _decode
    return self.tiled_decode(z, return_dict=return_dict)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "REDACT\venv\Lib\site-packages\diffusers\models\autoencoders\autoencoder_kl.py", line 412, in tiled_decode
    tile = self.post_quant_conv(tile)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: 'NoneType' object is not callable

Jun 24 '24 19:06 Teriks

The mentioned issues with VAE tiling are due to: vae/config.json

Having:

"use_post_quant_conv": false,
"use_quant_conv": false

Which causes the method used here:

https://github.com/huggingface/diffusers/blob/589931ca791deb8f896ee291ee481070755faa26/src/diffusers/models/autoencoders/autoencoder_kl.py#L363

And Here:

https://github.com/huggingface/diffusers/blob/589931ca791deb8f896ee291ee481070755faa26/src/diffusers/models/autoencoders/autoencoder_kl.py#L412

To be None

Perhaps at the moment, the model is simply not entirely compatible with the tiling in AutoEncoderKL, as the state dict does not possess the keys post_quant_conv.bias, quant_conv.weight, post_quant_conv.weight, quant_conv.bias

import os
import diffusers

os.environ['HF_TOKEN'] = 'your token'

vae = diffusers.AutoencoderKL.from_pretrained(
    'stabilityai/stable-diffusion-3-medium-diffusers',
    subfolder='vae',
    use_post_quant_conv=True,
    use_quant_conv=True)

Fail:

REDACT\venv\Lib\site-packages\diffusers\models\transformers\transformer_2d.py:34: FutureWarning: `Transformer2DModelOutput` is deprecated and will be removed in version 1.0.0. Importing `Transformer2DModelOutput` from `diffusers.models.transformer_2d` is deprecated and this will be removed in a future version. Please use `from diffusers.models.modeling_outputs import Transformer2DModelOutput`, instead.
  deprecate("Transformer2DModelOutput", "1.0.0", deprecation_message)
Traceback (most recent call last):
  File "REDACT\test.py", line 6, in <module>
    vae = diffusers.AutoencoderKL.from_pretrained(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "REDACT\venv\Lib\site-packages\huggingface_hub\utils\_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "REDACT\venv\Lib\site-packages\diffusers\models\modeling_utils.py", line 750, in from_pretrained
    raise ValueError(
ValueError: Cannot load <class 'diffusers.models.autoencoders.autoencoder_kl.AutoencoderKL'> from stabilityai/stable-diffusion-3-medium-diffusers because the following keys are missing: 
 post_quant_conv.weight, quant_conv.weight, post_quant_conv.bias, quant_conv.bias. 
 Please make sure to pass `low_cpu_mem_usage=False` and `device_map=None` if you want to randomly initialize those weights or else make sure your checkpoint file is correct.

Jun 24 '24 19:06 Teriks

The mentioned issues with VAE tiling are due to: vae/config.json

Having:

"use_post_quant_conv": false,
"use_quant_conv": false

Which causes the method used here:

https://github.com/huggingface/diffusers/blob/589931ca791deb8f896ee291ee481070755faa26/src/diffusers/models/autoencoders/autoencoder_kl.py#L363

And Here:

https://github.com/huggingface/diffusers/blob/589931ca791deb8f896ee291ee481070755faa26/src/diffusers/models/autoencoders/autoencoder_kl.py#L412

To be None

Perhaps at the moment, the model is simply not entirely compatible with the tiling in AutoEncoderKL, as the state dict does not possess the keys post_quant_conv.bias, quant_conv.weight, post_quant_conv.weight, quant_conv.bias

import os
import diffusers

os.environ['HF_TOKEN'] = 'your token'

vae = diffusers.AutoencoderKL.from_pretrained(
    'stabilityai/stable-diffusion-3-medium-diffusers',
    subfolder='vae',
    use_post_quant_conv=True,
    use_quant_conv=True)

Fail:

REDACT\venv\Lib\site-packages\diffusers\models\transformers\transformer_2d.py:34: FutureWarning: `Transformer2DModelOutput` is deprecated and will be removed in version 1.0.0. Importing `Transformer2DModelOutput` from `diffusers.models.transformer_2d` is deprecated and this will be removed in a future version. Please use `from diffusers.models.modeling_outputs import Transformer2DModelOutput`, instead.
  deprecate("Transformer2DModelOutput", "1.0.0", deprecation_message)
Traceback (most recent call last):
  File "REDACT\test.py", line 6, in <module>
    vae = diffusers.AutoencoderKL.from_pretrained(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "REDACT\venv\Lib\site-packages\huggingface_hub\utils\_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "REDACT\venv\Lib\site-packages\diffusers\models\modeling_utils.py", line 750, in from_pretrained
    raise ValueError(
ValueError: Cannot load <class 'diffusers.models.autoencoders.autoencoder_kl.AutoencoderKL'> from stabilityai/stable-diffusion-3-medium-diffusers because the following keys are missing: 
 post_quant_conv.weight, quant_conv.weight, post_quant_conv.bias, quant_conv.bias. 
 Please make sure to pass `low_cpu_mem_usage=False` and `device_map=None` if you want to randomly initialize those weights or else make sure your checkpoint file is correct.

how to solve the problem

Jul 02 '24 03:07 ucasyjz

@sayakpaul, I am not sure if sd3 has intended to support VAE tiling?

Jul 02 '24 14:07 Teriks

Please open a new issue for that.

Jul 02 '24 14:07 sayakpaul

https://github.com/huggingface/diffusers/issues/8788

Closing this

Jul 04 '24 03:07 Teriks