diffusers add OnnxStableDiffusionUpscalePipeline pipeline

I think I have a working implemention of an OnnxStableDiffusionUpscalePipeline, which extends StableDiffusionUpscalePipeline to be compatible with OnnxRuntimeModel. I'm hoping to get some feedback on whether this is the right approach, and if so, what else I need to do before this can be merged besides writing tests. There are a few spots in the code that I have questions about, marked with # TODOs and noted at the bottom here.

Motivation

Running the current StableDiffusionUpscalePipeline on a machine without CUDA acceleration can be pretty slow, even with relatively small 128x128 input images. I am writing a web UI for running ONNX pipelines that allows you to run a series of upscaling models (or one model repeatedly), but running StableDiffusionUpscalePipeline on a 1024px square input (split into 128px tiles) can easily take 60+ minutes on a 16 core CPU. Using the ONNX runtime is much faster, but that combination was not available, so I wrote this pipeline.

Per 128x128 tile:
- Using StableDiffusionUpscalePipeline: 2.98s/it or 02:28 per tile
- Using OnnxStableDiffusionUpscalePipeline w/ ROCmExecutionProvider: 6.46it/s or 00:07 per tile
- Using OnnxStableDiffusionUpscalePipeline w/ DMLExecutionProvider: 1.17it/s or 00:42 per tile
Upscaling 512x512 -> 2048x2048, 16 runs with 50 inference steps each:
- Using StableDiffusionUpscalePipeline: finished pipeline in 0:41:00.270845
- Using OnnxStableDiffusionUpscalePipeline w/ ROCmExecutionProvider: finished pipeline in 0:02:10.359478
Upscaling 1024x1024 -> 4096x4096, 64 runs with 50 inference steps each:
- Using StableDiffusionUpscalePipeline: still running
- Using OnnxStableDiffusionUpscalePipeline w/ ROCmExecutionProvider: finished pipeline in 0:05:53.323918

I have only tested this using the CPUExecutionProvider and ROCmExecutionProvider so far, but I have machines set up for testing the CUDAExecutionProvider and DMLExecutionProviders and will check on them as well.

I tried to make the least-necessary changes and ended up only overriding a few methods. It looks like the preference in some of the other pipelines is to copy methods, which I can also do, but I wanted to find the minimum viable diff. Most of the changes are around passing named parameters to the models and replacing .sample with [0], but there are a few ndarray.int() calls that I'm not sure about, and the StableDiffusionUpscalePipeline code used some config values that do not appear to exist on OnnxRuntimeModel.

Example

prompt = "an astronaut eating a hamburger"
steps = 50

txt2img = StableDiffusionOnnxPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    revision="onnx",
    provider="CUDAExecutionProvider",
)
small_image = txt2img(
    prompt,
    num_inference_steps=steps,
).images[0]

generator = torch.manual_seed(0)
upscale = OnnxStableDiffusionUpscalePipeline.from_pretrained(
    "ssube/stable-diffusion-x4-upscaler-onnx",
    provider="CUDAExecutionProvider",
)
large_image = upscale(
    prompt,
    small_image,
    generator=generator,
    num_inference_steps=steps,
).images[0]

TODOs

[x] lint (style & quality)
[x] unit tests
[x] docs?
test with provider:
- [x] CUDA
- [x] CPU
- [x] DirectML
- [x] ROCm
in code:
- [x] look up vae.config.latent_channels: https://github.com/huggingface/diffusers/pull/2158/files#diff-3815a0888bb607ca69fe4022fa3b4a809687fe2b3ae4d0ea0397288fac3c920bR18
- [x] look up unet.config.in_channels: https://github.com/huggingface/diffusers/pull/2158/files#diff-3815a0888bb607ca69fe4022fa3b4a809687fe2b3ae4d0ea0397288fac3c920bR21
- [x] convert text_embeddings.dtype to torch: https://github.com/huggingface/diffusers/pull/2158/files#diff-3815a0888bb607ca69fe4022fa3b4a809687fe2b3ae4d0ea0397288fac3c920bR97
- [x] make sure text_input_ids.int() is safe: https://github.com/huggingface/diffusers/pull/2158/files#diff-3815a0888bb607ca69fe4022fa3b4a809687fe2b3ae4d0ea0397288fac3c920bR226
  - should be equivalent to .astype(np.int32) in https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_onnx_stable_diffusion.py#L150
- [x] check if attention_mask is needed: https://github.com/huggingface/diffusers/pull/2158/files#diff-3815a0888bb607ca69fe4022fa3b4a809687fe2b3ae4d0ea0397288fac3c920bR228
  - I don't see this being used in the other ONNX pipelines, like img2img
- [x] check if text_embeddings.view is needed: https://github.com/huggingface/diffusers/pull/2158/files#diff-3815a0888bb607ca69fe4022fa3b4a809687fe2b3ae4d0ea0397288fac3c920bR235

Jan 30 '23 04:01 ssube

The documentation is not available anymore as the PR was closed or merged.

Jan 30 '23 04:01 HuggingFaceDocBuilderDev

I added a basic test, which is passing locally (13 passed, 10 skipped in 67.31s (0:01:07)), but relies on an ONNX revision of stabilityai/stable-diffusion-x4-upscaler that does not exist in the https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler/tree/main repo.

Jan 31 '23 01:01 ssube

@ssube And how did you translate this model into ONYX format? I catch a bunch of errors. (AMD GPU & Windows) i wanna check into DML mode

Jan 31 '23 02:01 ForserX

@ForserX I'm using this script: https://github.com/ssube/onnx-web/blob/main/api/onnx_web/convert.py#L206 It's very close to https://github.com/huggingface/diffusers/blob/main/scripts/convert_stable_diffusion_checkpoint_to_onnx.py, but the single_vae branches are new for upscaling. https://github.com/ssube/onnx-web/commit/bacce0ace9e67270dfe65b31a8ae18196c142009#diff-2b8422f2625f7e1cd0ca3fa3e9975deed7d4962823108c2fc29f14c53e2c0cc6 is the bulk of the changes. I got it to work by switching between class_labels and return_dict on the UNet inputs and export a single VAE rather than splitting the encoder/decoder. No idea if that's right. :smile:

Jan 31 '23 02:01 ssube

How difficult everything is... I'll try, if it doesn't work out, I'll ask for a ready-made model))

Jan 31 '23 02:01 ForserX

Using that convert.py script, I was able to convert the model on Windows 10 and run it using the DirectMLExecutionProvider on an AMD GPU. The output looks about right, nothing unusual showing up. I've added the iteration and 128px tile times to the description. It's not as fast as ROCm, from initial testing, but still much faster than CPU (roughly 5x).

Some logs from that:

Fetching 17 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 17/17 [02:49<00:00,  9.95s/it]
You have disabled the safety checker for <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> by passing `safety_checker=None`. Ensur
e that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applications open to the public. Both the diffusers team a
nd Hugging Face strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling it only for use-cases that involve analyzing network behavior 
or auditing its results. For more information, please have a look at https://github.com/huggingface/diffusers/pull/254 .
C:\Users\ssube\stabdiff\onnx-try-2\onnx-web\api\onnx_env\lib\site-packages\transformers\models\clip\modeling_clip.py:754: TracerWarning: torch.tensor results are registered as c
onstants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this fu
nction. In any other case, this might cause the trace to be incorrect.
  mask.fill_(torch.tensor(torch.finfo(dtype).min))
C:\Users\ssube\stabdiff\onnx-try-2\onnx-web\api\onnx_env\lib\site-packages\torch\onnx\symbolic_opset9.py:5408: UserWarning: Exporting aten::index operator of advanced indexing i
n opset 14 is achieved by combination of multiple ONNX operators, including Reshape, Transpose, Concat, and Gather. If indices include negative values, the exported graph will p
roduce incorrect results.
  warnings.warn(
[2023-01-30 21:03:37,446] INFO: __main__: UNET config: FrozenDict([('sample_size', 128), ('in_channels', 7), ('out_channels', 4), ('center_input_sample', False), ('flip_sin_to_c
os', True), ('freq_shift', 0), ('down_block_types', ['DownBlock2D', 'CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'CrossAttnDownBlock2D']), ('mid_block_type', 'UNetMidBlock2DC
rossAttn'), ('up_block_types', ['CrossAttnUpBlock2D', 'CrossAttnUpBlock2D', 'CrossAttnUpBlock2D', 'UpBlock2D']), ('only_cross_attention', [True, True, True, False]), ('block_out
_channels', [256, 512, 512, 1024]), ('layers_per_block', 2), ('downsample_padding', 1), ('mid_block_scale_factor', 1), ('act_fn', 'silu'), ('norm_num_groups', 32), ('norm_eps', 
1e-05), ('cross_attention_dim', 1024), ('attention_head_dim', 8), ('dual_cross_attention', False), ('use_linear_projection', True), ('class_embed_type', None), ('num_class_embed
s', 1000), ('upcast_attention', False), ('resnet_time_scale_shift', 'default'), ('_class_name', 'UNet2DConditionModel'), ('_diffusers_version', '0.9.0.dev0'), ('_name_or_path', 
'C:\\Users\\ssube/.cache\\huggingface\\diffusers\\models--stabilityai--stable-diffusion-x4-upscaler\\snapshots\\19b610c68ca7572defb6e09e64d1063f32b4db83\\unet')])
[2023-01-30 21:04:33,172] INFO: __main__: VAE config: FrozenDict([('in_channels', 3), ('out_channels', 3), ('down_block_types', ['DownEncoderBlock2D', 'DownEncoderBlock2D', 'Dow
nEncoderBlock2D']), ('up_block_types', ['UpDecoderBlock2D', 'UpDecoderBlock2D', 'UpDecoderBlock2D']), ('block_out_channels', [128, 256, 512]), ('layers_per_block', 2), ('act_fn'
, 'silu'), ('latent_channels', 4), ('norm_num_groups', 32), ('sample_size', 256), ('_class_name', 'AutoencoderKL'), ('_diffusers_version', '0.9.0.dev0'), ('_name_or_path', 'C:\\
Users\\ssube/.cache\\huggingface\\diffusers\\models--stabilityai--stable-diffusion-x4-upscaler\\snapshots\\19b610c68ca7572defb6e09e64d1063f32b4db83\\vae')])
[2023-01-30 21:04:43,174] INFO: __main__: exporting ONNX model
[2023-01-30 21:04:43,225] INFO: __main__: ONNX pipeline saved to ..\models\upscaling-stable-diffusion-x4
[2023-01-30 21:04:47,267] INFO: __main__: ONNX pipeline is loadable

and

[2023-01-30 21:29:02,983] INFO: onnx_web.chain.upscale_outpaint: final output image size: 1024x1024
[2023-01-30 21:29:02,984] INFO: onnx_web.chain.base: finished stage expand, result size: 1024x1024
[2023-01-30 21:29:02,984] INFO: onnx_web.chain.base: running stage upscale on image with dimensions 1024x1024, dict_keys(['output', 'size', 'prompt', 'scale', 'outscale', 'tile_
size', 'upscale'])
[2023-01-30 21:29:02,984] INFO: onnx_web.chain.base: image larger than tile size of SizeChart.mini, tiling stage
[2023-01-30 21:29:02,992] INFO: onnx_web.chain.utils: processing tile 1 of 64, 0.0
[2023-01-30 21:29:02,993] INFO: onnx_web.chain.upscale_stable_diffusion: upscaling with Stable Diffusion, 50 steps
2023-01-30 21:29:03.0777243 [W:onnxruntime:, inference_session.cc:493 onnxruntime::InferenceSession::RegisterExecutionProvider] Having memory pattern enabled is not supported while using the DML Execution Provider. So disabling it for this session since it uses the DML Execution Provider.
2023-01-30 21:29:04.0214862 [W:onnxruntime:, session_state.cc:1030 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2023-01-30 21:29:04.0253514 [W:onnxruntime:, session_state.cc:1032 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
2023-01-30 21:29:05.8783892 [W:onnxruntime:, inference_session.cc:493 onnxruntime::InferenceSession::RegisterExecutionProvider] Having memory pattern enabled is not supported while using the DML Execution Provider. So disabling it for this session since it uses the DML Execution Provider.
2023-01-30 21:29:05.9192614 [W:onnxruntime:, session_state.cc:1030 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2023-01-30 21:29:05.9231882 [W:onnxruntime:, session_state.cc:1032 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
2023-01-30 21:29:06.5044530 [W:onnxruntime:, inference_session.cc:493 onnxruntime::InferenceSession::RegisterExecutionProvider] Having memory pattern enabled is not supported while using the DML Execution Provider. So disabling it for this session since it uses the DML Execution Provider.
2023-01-30 21:29:06.7290335 [W:onnxruntime:, session_state.cc:1030 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2023-01-30 21:29:06.7331632 [W:onnxruntime:, session_state.cc:1032 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
  8%|███████████▏                                                                                                                                | 4/50 [00:09<01:53,  2.47s/it]

Jan 31 '23 03:01 ssube

(roughly 5x)

The Vulkan variation of ESRGAN works even faster

Check mail, Please

Jan 31 '23 03:01 ForserX

I pushed a copy of the model that I have been using to https://huggingface.co/ssube/stable-diffusion-x4-upscaler-onnx and updated the tests accordingly :crossed_fingers:

Jan 31 '23 04:01 ssube

Cool, cc @anton-l @echarlaix for review

Jan 31 '23 13:01 patrickvonplaten

It remains to wait custiom VAE and LoRA for ONNX))

Jan 31 '23 18:01 ForserX

I added another, longer test and fixed up a few of the TODOs. The remaining ones are all related to hard-coded channel counts and the text_embeddings dtype, and I'm not sure where to look those up, they don't seem to be present on the OnnxRuntimeModel.

I also tried adding attention_mask back to the text encoder, but I don't see it being used in the other ONNX pipelines, and attempting to add it causes an 2 : INVALID_ARGUMENT : Invalid Feed Input Name:attention_mask error.

Feb 01 '23 04:02 ssube

cc @anton-l

Feb 09 '23 10:02 patrickvonplaten

Is there anything else I can/should add to this? I'm not sure where to look up the vae.config/unet.config equivalents, or how important that is.

Feb 09 '23 14:02 ssube

@anton-l can you take a look here?

Feb 13 '23 11:02 patrickvonplaten

I've been using and testing this pipeline more, with more schedulers, and fixed a couple of issues related to the mix of numpy and torch types. There was an unsupported operand type(s) for *: 'numpy.ndarray' and 'Tensor' error with some (but not all) schedulers, which I fixed based on https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_onnx_stable_diffusion_inpaint.py#L437. I've added tests for all of the same schedulers that are tested in https://github.com/huggingface/diffusers/blob/main/tests/pipelines/stable_diffusion/test_onnx_stable_diffusion_img2img.py except for a fast test for LMS discrete, which was timing out.

There were a few .config lookups that I wasn't sure about, but it looks like the other ONNX pipelines declare them as constants, so I did the same: https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_onnx_stable_diffusion_inpaint.py#L34

The last issue I'm aware of is a slight difference between the parameter types to the scheduler.step() call: many of the other ONNX pipelines use something like torch.from_numpy(noise_pred), t, torch.from_numpy(latents), **extra_step_kwargs, but I think due to the restrictions in ORT, converting latents to a tensor causes an TypeError: expected np.ndarray (got Tensor) error and does not seem right here. torch.from_numpy(noise_pred), t, latents, **extra_step_kwargs does appear to work.

I did run into one issue with int32 vs int64 types, but that appears to be related to how the model is trained or serialized, and exporting it again with the 4th input as a torch.long solved that:

     # UNET
     if single_vae:
         unet_inputs = ["sample", "timestep", "encoder_hidden_states", "class_labels"]
-        unet_scale = torch.tensor(4).to(device=ctx.training_device, dtype=torch.int)
+        unet_scale = torch.tensor(4).to(
+            device=ctx.training_device, dtype=torch.long
+        )

Feb 15 '23 23:02 ssube

I inlined the integer type and put in lookups for the other two. One of them needed to go from numpy to the torch dtype since that's what the StableDiffusionUpscalePipeline expects, so I put in a little lookup table for that, hopefully that is ok: https://github.com/huggingface/diffusers/pull/2158/commits/75cadf230e518d0c942eee0cd3187cee4ae59b28#diff-3815a0888bb607ca69fe4022fa3b4a809687fe2b3ae4d0ea0397288fac3c920bR20-R23

For the int32/64 issue that I mentioned, I tested that a little bit more, and everything seems to work as long as the type in the convert/export code and the pipeline match. Is there any reason not to use int64 there? For more context, this is my convert script and the relevant part is:

    # UNET
    if single_vae: # upscale pipeline
        unet_inputs = ["sample", "timestep", "encoder_hidden_states", "class_labels"]
        unet_scale = torch.tensor(4).to(device=ctx.training_device, dtype=torch.long) # <- this is the type that needs to match
    else:
        unet_inputs = ["sample", "timestep", "encoder_hidden_states", "return_dict"]
        unet_scale = torch.tensor(False).to(
            device=ctx.training_device, dtype=torch.bool
        )

    unet_in_channels = pipeline.unet.config.in_channels
    unet_sample_size = pipeline.unet.config.sample_size
    unet_path = output_path / "unet" / "model.onnx"
    onnx_export(
        pipeline.unet,
        model_args=(
            torch.randn(2, unet_in_channels, unet_sample_size, unet_sample_size).to(
                device=ctx.training_device, dtype=dtype
            ),
            torch.randn(2).to(device=ctx.training_device, dtype=dtype),
            torch.randn(2, num_tokens, text_hidden_size).to(
                device=ctx.training_device, dtype=dtype
            ),
            unet_scale,
        ),
        output_path=unet_path,
        ordered_input_names=unet_inputs,
        # has to be different from "sample" for correct tracing
        output_names=["out_sample"],
        dynamic_axes={
            "sample": {0: "batch", 1: "channels", 2: "height", 3: "width"},
            "timestep": {0: "batch"},
            "encoder_hidden_states": {0: "batch", 1: "sequence"},
        },
        opset=ctx.opset,
        use_external_data_format=True,  # UNet is > 2GB, so the weights need to be split
    )

Feb 17 '23 04:02 ssube

Looks good to me - thanks for checking the PR @anton-l :-)

cc @williamberman could you also take a quick look?

Feb 27 '23 17:02 patrickvonplaten

Merging to not block the community contributor here

Mar 06 '23 10:03 patrickvonplaten

Hello. On version diffusers > 0.16.0 this pipeline throws exception due to vae.config attribute check is removed.

File "/opt/conda/envs/lora/lib/python3.9/site-packages/diffusers/pipelines/pipeline_utils.py", line 1101, in from_pretrained
    model = pipeline_class(**init_kwargs)
  File "/opt/conda/envs/lora/lib/python3.9/site-packages/diffusers/pipelines/stable_diffusion/pipeline_onnx_stable_diffusion_upscale.py", line 59, in __init__
    super().__init__(
  File "/opt/conda/envs/lora/lib/python3.9/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_upscale.py", line 134, in __init__
    self.vae_scale_factor = 2 ** (len(self.vae.config.block_out_channels) - 1)
AttributeError: 'OnnxRuntimeModel' object has no attribute 'config'

Jul 25 '23 21:07 zetyquickly

Thanks for the ping @zetyquickly ! Would you like to open an issue to fix it?

Aug 02 '23 18:08 patrickvonplaten