Tensor size mismatch for non pow of 2 sized image, SD3ControlNetModel
Describe the bug
There seems to be an issue with certain non power of 2 sized control net guidance images when using SD3ControlNetModel
Reproduction
import diffusers
import PIL.Image
import os
import torch
os.environ['HF_TOKEN'] = 'your token'
cn = diffusers.SD3ControlNetModel.from_pretrained('InstantX/SD3-Controlnet-Canny')
pipe = diffusers.StableDiffusion3ControlNetPipeline.from_pretrained(
'stabilityai/stable-diffusion-3-medium-diffusers',
controlnet=cn)
pipe.enable_sequential_cpu_offload()
# aligned by 8, not a power of 2
output_size = (1376, 920)
not_pow_2 = PIL.Image.new('RGB', output_size)
args = {
'guidance_scale': 8.0,
'num_inference_steps': 30,
'width': output_size[0],
'height': output_size[1],
'control_image': not_pow_2,
'prompt': 'test prompt'
}
pipe(**args)
Logs
REDACT\venv\Lib\site-packages\diffusers\models\attention_processor.py:1584: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.)
hidden_states = F.scaled_dot_product_attention(
0%| | 0/30 [00:49<?, ?it/s]
Traceback (most recent call last):
File "REDACT\test.py", line 37, in <module>
pipe(**args)
File "REDACT\venv\Lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "REDACT\venv\Lib\site-packages\diffusers\pipelines\controlnet_sd3\pipeline_stable_diffusion_3_controlnet.py", line 1020, in __call__
latents = self.scheduler.step(noise_pred, t, latents, return_dict=False)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "REDACT\venv\Lib\site-packages\diffusers\schedulers\scheduling_flow_match_euler_discrete.py", line 268, in step
denoised = sample - model_output * sigma
~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
RuntimeError: The size of tensor a (115) must match the size of tensor b (114) at non-singleton dimension 2
System Info
Platform: Windows
Python 3.12.3 diffusers 0.29.1 transformers 4.41.2 accelerate 0.31.0
Who can help?
@sayakpaul @yiyixuxu @DN6
Does the regular SD3 pipeline work with this resolution?
Cc: @ResearcherXman
now that you mention, it seems that it does not, possibly related to https://github.com/huggingface/diffusers/issues/8668
In that issue, input images that are not aligned by 16 for the img2img pipeline cause this issue to crop up in the scheduler.
Power of 2 images seem to work consistently when a control net is involved, for the img2img and txt2img pipeline, resolutions aligned to 16 appear to work consistently.
import diffusers
import os
os.environ['HF_TOKEN'] = 'your token'
pipe = diffusers.StableDiffusion3Pipeline.from_pretrained(
'stabilityai/stable-diffusion-3-medium-diffusers')
pipe.enable_sequential_cpu_offload()
# aligned by 8, not a power of 2
output_size = (1376, 920)
args = {
'guidance_scale': 8.0,
'num_inference_steps': 30,
'width': output_size[0],
'height': output_size[1],
'prompt': 'test prompt'
}
pipe(**args)
Result:
0%| | 0/30 [00:00<?, ?it/s]REDACT\venv\Lib\site-packages\diffusers\models\attention_processor.py:1135: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.)
hidden_states = hidden_states = F.scaled_dot_product_attention(
0%| | 0/30 [00:46<?, ?it/s]
Traceback (most recent call last):
File "REDACT\test.py", line 24, in <module>
pipe(**args)
File "REDACT\venv\Lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "REDACT\venv\Lib\site-packages\diffusers\pipelines\stable_diffusion_3\pipeline_stable_diffusion_3.py", line 862, in __call__
latents = self.scheduler.step(noise_pred, t, latents, return_dict=False)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "REDACT\venv\Lib\site-packages\diffusers\schedulers\scheduling_flow_match_euler_discrete.py", line 268, in step
denoised = sample - model_output * sigma
~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
RuntimeError: The size of tensor a (115) must match the size of tensor b (114) at non-singleton dimension 2
If you specify the nearest value aligned to 16 it works
output_size = (1376-(1376%16), 920-(920%16))
I always thought this was kind of a universal knowledge, but you can't just put any resolution, I don't remember all of them, but mostly you need to use resolutions that are multiple of 8 for most of the SD models.
Also in the official ComfyUI workflow for SD3, they put this note:
So if there's no information about this in the docs, probably need to add it.
cc: @stevhliu
The pipeline itself throws an error noting you are not using a multiple of 8 dimension if you do not use a dimension that is a multiple of 8. It is failing here on a multiple of 8. Though perhaps it needs some other alignment.
There is an incongruency in the division that determines the size of the prepared latents here: https://github.com/huggingface/diffusers/blob/f1f542bdd4412a97ac109dab22959a8a575f3a7e/src/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3.py#L588
And the division that determines the size of the tensor returned by the transformer here starting here:
https://github.com/huggingface/diffusers/blob/f1f542bdd4412a97ac109dab22959a8a575f3a7e/src/diffusers/models/transformers/transformer_sd3.py#L378
For this value (1376, 920) which is aligned by eight.
Height: (int(height) // self.vae_scale_factor) or rather (int(920) // 8), is divided with round/truncation by the patch size 2 resulting in 115//2 = 57
Information is lost in the integer round IE. .5 is lost as 115.0 / 2.0 would be 57.5
In the multiplication height * patch_size which creates the dimension of the transformer output, the result is then 114, which is a mismatched dimension with the latents tensor
# unpatchify
patch_size = self.config.patch_size
height = height // patch_size
width = width // patch_size
hidden_states = hidden_states.reshape(
shape=(hidden_states.shape[0], height, width, patch_size, patch_size, self.out_channels)
)
hidden_states = torch.einsum("nhwpqc->nchpwq", hidden_states)
output = hidden_states.reshape(
shape=(hidden_states.shape[0], self.out_channels, height * patch_size, width * patch_size)
)
This is what causes these two tensors to be mismatched in dimension.
From this it somewhat seems that the proper required alignment would be 16, or it could be 64, though it is not documented or enforced by the pipeline via an exception.
It seems definitely not related to power of 2 in the basic use case.
For instance, alignment by 16 with a controlnet works, contrary to what I mentioned before, I probably miss keyed a number.
The resulting height of the resolution 912 is not aligned by 64, and it works, though it could be a fluke.
import diffusers
import PIL.Image
import os
os.environ['HF_TOKEN'] = 'your token'
cn = diffusers.SD3ControlNetModel.from_pretrained('InstantX/SD3-Controlnet-Canny')
pipe = diffusers.StableDiffusion3ControlNetPipeline.from_pretrained(
'stabilityai/stable-diffusion-3-medium-diffusers',
controlnet=cn)
pipe.enable_sequential_cpu_offload()
width = 1376
height = 920
# aligned by 16
output_size = (width-(width % 16), height-(height % 16))
not_pow_2 = PIL.Image.new('RGB', output_size)
args = {
'guidance_scale': 8.0,
'num_inference_steps': 30,
'width': output_size[0],
'height': output_size[1],
'control_image': not_pow_2,
'prompt': 'test prompt'
}
pipe(**args)
In addition it seems that alignments of 16 or even 64 cause issues with VAE tiling, however power of 2 sized images succeed, which is what I may have mixed up in this issue. This seems to occur with and without a controlnet.
I have closed the other issue as it is basically the same issue.
import diffusers
import PIL.Image
import os
os.environ['HF_TOKEN'] = 'your token'
cn = diffusers.SD3ControlNetModel.from_pretrained('InstantX/SD3-Controlnet-Canny')
pipe = diffusers.StableDiffusion3ControlNetPipeline.from_pretrained(
'stabilityai/stable-diffusion-3-medium-diffusers',
controlnet=cn)
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_tiling()
width = 1376
height = 920
# aligned by 16, but alignment by 64 also fails
output_size = (width-(width % 16), height-(height % 16))
not_pow_2 = PIL.Image.new('RGB', output_size)
args = {
'guidance_scale': 8.0,
'num_inference_steps': 30,
'width': output_size[0],
'height': output_size[1],
'control_image': not_pow_2,
'prompt': 'test prompt'
}
pipe(**args)
Fail:
REDACT\venv\Lib\site-packages\diffusers\models\attention_processor.py:1584: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.)
hidden_states = F.scaled_dot_product_attention(
Traceback (most recent call last):
File "REDACT\test.py", line 35, in <module>
pipe(**args)
File "REDACT\venv\Lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "REDACT\venv\Lib\site-packages\diffusers\pipelines\controlnet_sd3\pipeline_stable_diffusion_3_controlnet.py", line 912, in __call__
control_image = self.vae.encode(control_image).latent_dist.sample()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "REDACT\venv\Lib\site-packages\diffusers\utils\accelerate_utils.py", line 46, in wrapper
return method(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "REDACT\venv\Lib\site-packages\diffusers\models\autoencoders\autoencoder_kl.py", line 258, in encode
return self.tiled_encode(x, return_dict=return_dict)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "REDACT\venv\Lib\site-packages\diffusers\models\autoencoders\autoencoder_kl.py", line 363, in tiled_encode
tile = self.quant_conv(tile)
^^^^^^^^^^^^^^^^^^^^^
TypeError: 'NoneType' object is not callable
This code succeeds:
import diffusers
import PIL.Image
import os
os.environ['HF_TOKEN'] = 'your token'
cn = diffusers.SD3ControlNetModel.from_pretrained('InstantX/SD3-Controlnet-Canny')
pipe = diffusers.StableDiffusion3ControlNetPipeline.from_pretrained(
'stabilityai/stable-diffusion-3-medium-diffusers',
controlnet=cn)
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_tiling()
width = 1024
height = 1024
output_size = (width, height)
pow_2 = PIL.Image.new('RGB', output_size)
args = {
'guidance_scale': 8.0,
'num_inference_steps': 30,
'width': output_size[0],
'height': output_size[1],
'control_image': pow_2,
'prompt': 'test prompt'
}
pipe(**args)
- Edited these due to late night mix up between WxH and HxW
The normal pipeline presents with a different error with VAE tiling enabled, after denoising completes.
It takes quite a while to run these (for me) at 30 inference steps, I have reduced it to 3 here.
This is the source of my original confusion with a possible power of 2 requirement which I had not accounted for in my code (vae tiling being on), though it seems the pipelines work with an alignment of 16 otherwise. Just not with VAE tiling, which only works with a power of 2 currently.
import diffusers
import os
os.environ['HF_TOKEN'] = 'your token'
pipe = diffusers.StableDiffusion3Pipeline.from_pretrained(
'stabilityai/stable-diffusion-3-medium-diffusers')
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_tiling()
width = 1376
height = 920
# aligned by 16, but alignment by 64 also fails
output_size = (width-(width % 16), height-(height % 16))
args = {
'guidance_scale': 8.0,
'num_inference_steps': 3,
'width': output_size[0],
'height': output_size[1],
'prompt': 'test prompt'
}
pipe(**args)
Log:
Will try to load from local cache.
Loading pipeline components...: 0%| | 0/9 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 50%|█████ | 1/2 [00:04<00:04, 4.34s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:09<00:00, 4.56s/it]
Loading pipeline components...: 78%|███████▊ | 7/9 [00:11<00:01, 1.06it/s]You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading pipeline components...: 100%|██████████| 9/9 [00:45<00:00, 5.01s/it]
0%| | 0/3 [00:00<?, ?it/s]REDACT\venv\Lib\site-packages\diffusers\models\attention_processor.py:1135: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.)
hidden_states = hidden_states = F.scaled_dot_product_attention(
100%|██████████| 3/3 [01:16<00:00, 25.65s/it]
Traceback (most recent call last):
File "REDACT\test.py", line 27, in <module>
pipe(**args)
File "REDACT\venv\Lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "REDACT\venv\Lib\site-packages\diffusers\pipelines\stable_diffusion_3\pipeline_stable_diffusion_3.py", line 895, in __call__
image = self.vae.decode(latents, return_dict=False)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "REDACT\venv\Lib\site-packages\diffusers\utils\accelerate_utils.py", line 46, in wrapper
return method(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "REDACT\venv\Lib\site-packages\diffusers\models\autoencoders\autoencoder_kl.py", line 314, in decode
decoded = self._decode(z).sample
^^^^^^^^^^^^^^^
File "REDACT\venv\Lib\site-packages\diffusers\models\autoencoders\autoencoder_kl.py", line 280, in _decode
return self.tiled_decode(z, return_dict=return_dict)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "REDACT\venv\Lib\site-packages\diffusers\models\autoencoders\autoencoder_kl.py", line 412, in tiled_decode
tile = self.post_quant_conv(tile)
^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: 'NoneType' object is not callable
The mentioned issues with VAE tiling are due to: vae/config.json
Having:
"use_post_quant_conv": false,
"use_quant_conv": false
Which causes the method used here:
https://github.com/huggingface/diffusers/blob/589931ca791deb8f896ee291ee481070755faa26/src/diffusers/models/autoencoders/autoencoder_kl.py#L363
And Here:
https://github.com/huggingface/diffusers/blob/589931ca791deb8f896ee291ee481070755faa26/src/diffusers/models/autoencoders/autoencoder_kl.py#L412
To be None
Perhaps at the moment, the model is simply not entirely compatible with the tiling in AutoEncoderKL, as the state dict does not possess the keys post_quant_conv.bias, quant_conv.weight, post_quant_conv.weight, quant_conv.bias
import os
import diffusers
os.environ['HF_TOKEN'] = 'your token'
vae = diffusers.AutoencoderKL.from_pretrained(
'stabilityai/stable-diffusion-3-medium-diffusers',
subfolder='vae',
use_post_quant_conv=True,
use_quant_conv=True)
Fail:
REDACT\venv\Lib\site-packages\diffusers\models\transformers\transformer_2d.py:34: FutureWarning: `Transformer2DModelOutput` is deprecated and will be removed in version 1.0.0. Importing `Transformer2DModelOutput` from `diffusers.models.transformer_2d` is deprecated and this will be removed in a future version. Please use `from diffusers.models.modeling_outputs import Transformer2DModelOutput`, instead.
deprecate("Transformer2DModelOutput", "1.0.0", deprecation_message)
Traceback (most recent call last):
File "REDACT\test.py", line 6, in <module>
vae = diffusers.AutoencoderKL.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "REDACT\venv\Lib\site-packages\huggingface_hub\utils\_validators.py", line 114, in _inner_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "REDACT\venv\Lib\site-packages\diffusers\models\modeling_utils.py", line 750, in from_pretrained
raise ValueError(
ValueError: Cannot load <class 'diffusers.models.autoencoders.autoencoder_kl.AutoencoderKL'> from stabilityai/stable-diffusion-3-medium-diffusers because the following keys are missing:
post_quant_conv.weight, quant_conv.weight, post_quant_conv.bias, quant_conv.bias.
Please make sure to pass `low_cpu_mem_usage=False` and `device_map=None` if you want to randomly initialize those weights or else make sure your checkpoint file is correct.
The mentioned issues with VAE tiling are due to: vae/config.json
Having:
"use_post_quant_conv": false, "use_quant_conv": falseWhich causes the method used here:
https://github.com/huggingface/diffusers/blob/589931ca791deb8f896ee291ee481070755faa26/src/diffusers/models/autoencoders/autoencoder_kl.py#L363
And Here:
https://github.com/huggingface/diffusers/blob/589931ca791deb8f896ee291ee481070755faa26/src/diffusers/models/autoencoders/autoencoder_kl.py#L412
To be
NonePerhaps at the moment, the model is simply not entirely compatible with the tiling in AutoEncoderKL, as the state dict does not possess the keys
post_quant_conv.bias, quant_conv.weight, post_quant_conv.weight, quant_conv.biasimport os import diffusers os.environ['HF_TOKEN'] = 'your token' vae = diffusers.AutoencoderKL.from_pretrained( 'stabilityai/stable-diffusion-3-medium-diffusers', subfolder='vae', use_post_quant_conv=True, use_quant_conv=True)Fail:
REDACT\venv\Lib\site-packages\diffusers\models\transformers\transformer_2d.py:34: FutureWarning: `Transformer2DModelOutput` is deprecated and will be removed in version 1.0.0. Importing `Transformer2DModelOutput` from `diffusers.models.transformer_2d` is deprecated and this will be removed in a future version. Please use `from diffusers.models.modeling_outputs import Transformer2DModelOutput`, instead. deprecate("Transformer2DModelOutput", "1.0.0", deprecation_message) Traceback (most recent call last): File "REDACT\test.py", line 6, in <module> vae = diffusers.AutoencoderKL.from_pretrained( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "REDACT\venv\Lib\site-packages\huggingface_hub\utils\_validators.py", line 114, in _inner_fn return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "REDACT\venv\Lib\site-packages\diffusers\models\modeling_utils.py", line 750, in from_pretrained raise ValueError( ValueError: Cannot load <class 'diffusers.models.autoencoders.autoencoder_kl.AutoencoderKL'> from stabilityai/stable-diffusion-3-medium-diffusers because the following keys are missing: post_quant_conv.weight, quant_conv.weight, post_quant_conv.bias, quant_conv.bias. Please make sure to pass `low_cpu_mem_usage=False` and `device_map=None` if you want to randomly initialize those weights or else make sure your checkpoint file is correct.
how to solve the problem
@sayakpaul, I am not sure if sd3 has intended to support VAE tiling?
Please open a new issue for that.
https://github.com/huggingface/diffusers/issues/8788
Closing this