diffusers icon indicating copy to clipboard operation
diffusers copied to clipboard

how to load lora weight with fp8 transfomer model?

Open Johnson-yue opened this issue 8 months ago β€’ 22 comments

Hi, I want to run fluxcontrolpipeline with transformer_fp8 reference the code : https://huggingface.co/docs/diffusers/api/pipelines/flux#quantization

import torch
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, FluxTransformer2DModel, FluxControlPipeline
from transformers import BitsAndBytesConfig as BitsAndBytesConfig, T5EncoderModel

quant_config = BitsAndBytesConfig(load_in_8bit=True)
text_encoder_8bit = T5EncoderModel.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    subfolder="text_encoder_2",
    quantization_config=quant_config,
    torch_dtype=torch.float16,
)

quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True)
transformer_8bit = FluxTransformer2DModel.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    subfolder="transformer",
    quantization_config=quant_config,
    torch_dtype=torch.float16,
)

pipeline = FluxControlPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    text_encoder_2=text_encoder_8bit,
    transformer=transformer_8bit,
    torch_dtype=torch.float16,
    device_map="balanced",
)

prompt = "a tiny astronaut hatching from an egg on the moon"
image = pipeline(prompt, guidance_scale=3.5, height=768, width=1360, num_inference_steps=50).images[0]
image.save("flux.png")

but when I load lora after build a pipeline

pipeline = FluxControlPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    text_encoder_2=text_encoder_8bit,
    transformer=transformer_8bit,
    torch_dtype=torch.float16,
    device_map="balanced",
)

pipe.load_lora_weights("black-forest-labs/FLUX.1-Depth-dev-lora")

There a error: not support fp8 weight , how to fix it??

Johnson-yue avatar Jun 03 '25 10:06 Johnson-yue

Could you share the stack trace from the error and the output of diffusers-cli env please? cc @sayakpaul in case this is a known problem of loading lora weights into fp8 bnb transformer

a-r-r-o-w avatar Jun 03 '25 10:06 a-r-r-o-w

@a-r-r-o-w diffusers-cli env info :

- πŸ€— Diffusers version: 0.34.0.dev0
- Platform: Linux-5.15.0-136-generic-x86_64-with-glibc2.31
- Running on Google Colab?: No
- Python version: 3.11.11
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.29.3
- Transformers version: 4.47.1
- Accelerate version: 1.2.1
- PEFT version: 0.15.2
- Bitsandbytes version: 0.45.0
- Safetensors version: 0.4.5
- xFormers version: 0.0.29
- Accelerator: NVIDIA A800 80GB PCIe, 81920 MiB
NVIDIA A800 80GB PCIe, 81920 MiB
NVIDIA A800 80GB PCIe, 81920 MiB
NVIDIA A800 80GB PCIe, 81920 MiB
NVIDIA A800 80GB PCIe, 81920 MiB
NVIDIA A800 80GB PCIe, 81920 MiB
NVIDIA A800 80GB PCIe, 81920 MiB
NVIDIA A800 80GB PCIe, 81920 MiB
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

When using transformer_fp8 load lora error is

Traceback (most recent call last):
  File "/data/yue/anaconda3/envs/flux/lib/python3.11/runpy.py", line 198, in _run_module_as_main
    return _run_code(code, main_globals, None,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/yue/anaconda3/envs/flux/lib/python3.11/runpy.py", line 88, in _run_code
    exec(code, run_globals)
  File "/data/yue/.vscode-server/extensions/ms-python.debugpy-2025.8.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/__main__.py", line 71, in <module>
    cli.main()
  File "/data/yue/.vscode-server/extensions/ms-python.debugpy-2025.8.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 501, in main
    run()
  File "/data/yue/.vscode-server/extensions/ms-python.debugpy-2025.8.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 351, in run_file
    runpy.run_path(target, run_name="__main__")
  File "/data/yue/.vscode-server/extensions/ms-python.debugpy-2025.8.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 310, in run_path
    return _run_module_code(code, init_globals, run_name, pkg_name=pkg_name, script_name=fname)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/yue/.vscode-server/extensions/ms-python.debugpy-2025.8.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 127, in _run_module_code
    _run_code(code, mod_globals, init_globals, mod_name, mod_spec, pkg_name, script_name)
  File "/data/yue/.vscode-server/extensions/ms-python.debugpy-2025.8.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 118, in _run_code
    exec(code, run_globals)
  File "/data/yue/Work_Project/Flux_Deploy/4_api_server.py", line 399, in <module>
    flux_inference = FLUX_INFERENCE(pretrained_model_name_or_path, rmT5=True, transformer_fp8=True)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/yue/Work_Project/Flux_Deploy/4_api_server.py", line 220, in __init__
    self.control_pipe.load_lora_weights(depth_lora_path, adapter_name="depth")  # it must be load first
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/yue/DeepLearning/git_repos/diffusers/src/diffusers/loaders/lora_pipeline.py", line 2000, in load_lora_weights
    has_param_with_expanded_shape = self._maybe_expand_transformer_param_shape_or_error_(
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/yue/DeepLearning/git_repos/diffusers/src/diffusers/loaders/lora_pipeline.py", line 2490, in _maybe_expand_transformer_param_shape_or_error_
    expanded_module = torch.nn.Linear(
                      ^^^^^^^^^^^^^^^^
  File "/data/yue/anaconda3/envs/flux/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 105, in __init__
    self.weight = Parameter(
                  ^^^^^^^^^^
  File "/data/yue/anaconda3/envs/flux/lib/python3.11/site-packages/torch/nn/parameter.py", line 46, in __new__
    return torch.Tensor._make_subclass(cls, data, requires_grad)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Only Tensors of floating point and complex dtype can require gradients

How to fix it ?? maybe use pipe.eval() but I test it and pipeline have no eval() function

Johnson-yue avatar Jun 04 '25 07:06 Johnson-yue

https://github.com/huggingface/diffusers/pull/11655

sayakpaul avatar Jun 04 '25 09:06 sayakpaul

@sayakpaul this pull request is not merge into main branch ??

Johnson-yue avatar Jun 06 '25 03:06 Johnson-yue

@sayakpaul when I use the test code to lora with transformer_fp8, speed is too slow! and some warning out

test code :

import torch
from diffusers import FluxControlPipeline
from image_gen_aux import DepthPreprocessor
from diffusers.utils import load_image
from diffusers.quantizers import PipelineQuantizationConfig

pipeline = FluxControlPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    quantization_config=PipelineQuantizationConfig(
        quant_backend="bitsandbytes_8bit", 
        quant_kwargs={"load_in_8bit": True}, 
        components_to_quantize=["transformer", "text_encoder_2"]
    ),
    torch_dtype=torch.float16,
).to("cuda")
pipeline.load_lora_weights("black-forest-labs/FLUX.1-Depth-dev-lora")

prompt = "A robot made of exotic candies and chocolates of different kinds. The background is filled with confetti and celebratory gifts."
control_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/robot.png")

processor = DepthPreprocessor.from_pretrained("LiheYoung/depth-anything-large-hf")
control_image = processor(control_image)[0].convert("RGB")

image = pipeline(
    prompt=prompt,
    control_image=control_image,
    height=1024,
    width=1024,
    num_inference_steps=30,
    guidance_scale=10.0,
    generator=torch.Generator().manual_seed(42),
).images[0]
image.save("output.png")

and warning :

.....
flux/lib/python3.11/site-packages/torch/nn/modules/module.py:2400: UserWarning: for single_transformer_blocks.36.attn.norm_k.weight: copying from a non-meta parameter in the checkpoint to a meta parameter in the current model, which is a no-op. (Did you mean to pass `assign=True` to assign items in the state dictionary to their corresponding key in the module instead of copying them in place?)
  warnings.warn(
flux/lib/python3.11/site-packages/torch/nn/modules/module.py:2400: UserWarning: for single_transformer_blocks.37.attn.norm_q.weight: copying from a non-meta parameter in the checkpoint to a meta parameter in the current model, which is a no-op. (Did you mean to pass `assign=True` to assign items in the state dictionary to their corresponding key in the module instead of copying them in place?)
  warnings.warn(
flux/lib/python3.11/site-packages/torch/nn/modules/module.py:2400: UserWarning: for single_transformer_blocks.37.attn.norm_k.weight: copying from a non-meta parameter in the checkpoint to a meta parameter in the current model, which is a no-op. (Did you mean to pass `assign=True` to assign items in the state dictionary to their corresponding key in the module instead of copying them in place?)
  warnings.warn(

43%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                                                                           | 13/30 [06:48<08:45, 30.91s/it]

machine is A800 80G

when not using lora with transformer_fp8 using below code :

import torch
from diffusers import FluxControlPipeline
from image_gen_aux import DepthPreprocessor
from diffusers.utils import load_image
from diffusers.quantizers import PipelineQuantizationConfig

pipeline = FluxControlPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-Depth-dev",
    quantization_config=PipelineQuantizationConfig(
        quant_backend="bitsandbytes_8bit", 
        quant_kwargs={"load_in_8bit": True}, 
        components_to_quantize=["transformer", "text_encoder_2"]
    ),
    torch_dtype=torch.float16,
).to("cuda")
# pipeline.load_lora_weights("black-forest-labs/FLUX.1-Depth-dev-lora")

prompt = "A robot made of exotic candies and chocolates of different kinds. The background is filled with confetti and celebratory gifts."
control_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/robot.png")

processor = DepthPreprocessor.from_pretrained("LiheYoung/depth-anything-large-hf")
control_image = processor(control_image)[0].convert("RGB")

image = pipeline(
    prompt=prompt,
    control_image=control_image,
    height=1024,
    width=1024,
    num_inference_steps=30,
    guidance_scale=10.0,
    generator=torch.Generator().manual_seed(42),
).images[0]
image.save("output.png")

This pipeline runtime faster than load lora

The module 'T5EncoderModel' has been loaded in `bitsandbytes` 8bit and moving it to cuda via `.to()` is not supported. Module is still on cuda:0.
The module 'FluxTransformer2DModel' has been loaded in `bitsandbytes` 8bit and moving it to cuda via `.to()` is not supported. Module is still on cuda:0.
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 30/30 [00:25<00:00,  1.18it/s]

Johnson-yue avatar Jun 06 '25 04:06 Johnson-yue

Could it be because when we load the LoRA into a 8bit base model, there's a single layer that is not quantized. It was a conscious decision we took when implementing this feature for the 4-bit use case as well. Can you try with 4bit and see what happens. Also, for what it's worth, can we update the description of the PR from fp8 to int8? When we leverage 8bit in bitsandbytes, it's following the LLM.int8() scheme. It is NOT FP8.

Tagging @matthewdouglas from the bitsandbytes team for the warning being faced.

sayakpaul avatar Jun 06 '25 04:06 sayakpaul

Btw, the same slow speed problem happens in the regular LoRA case too:

import torch
from diffusers import DiffusionPipeline
from diffusers.quantizers import PipelineQuantizationConfig

pipeline = DiffusionPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    quantization_config=PipelineQuantizationConfig(
        quant_backend="bitsandbytes_8bit", 
        quant_kwargs={"load_in_8bit": True}, 
        components_to_quantize=["transformer", "text_encoder_2"]
    ),
    torch_dtype=torch.float16,
).to("cuda")
pipeline.load_lora_weights("Purz/choose-your-own-adventure")

prompt = "cy04 A robot made of exotic candies and chocolates of different kinds. The background is filled with confetti and celebratory gifts."

image = pipeline(
    prompt=prompt,
    height=1024,
    width=1024,
    num_inference_steps=30,
    guidance_scale=3.5,
    generator=torch.Generator().manual_seed(42),
).images[0]
image.save("output.png")

So, I am not sure if it's a specific problem or a general problem with 8bit models + LoRA. I am on PyTorch 2.7 and bitsandbytes 0.46.0.

sayakpaul avatar Jun 06 '25 04:06 sayakpaul

I think I know why the slow speed problem. Long story short, it's stemming from here:

https://github.com/huggingface/diffusers/blob/0f91f2f6fc697f01ca6da6724e2b3b5600b56a9b/src/diffusers/loaders/peft.py#L416

@SunMarc the above code renders the is_sequential_cpu_offload to be True. Here's a minimal snippet to reproduce:

from diffusers import AutoModel, BitsAndBytesConfig
from accelerate.hooks import AlignDevicesHook
import torch 

model = AutoModel.from_pretrained(
    "black-forest-labs/FLUX.1-dev", 
    subfolder="transformer",
    quantization_config=BitsAndBytesConfig(load_in_8bit=True),
)

is_sequential_cpu_offload = (
    isinstance(model._hf_hook, AlignDevicesHook)
    or hasattr(model._hf_hook, "hooks")
    and isinstance(model._hf_hook.hooks[0], AlignDevicesHook)
)
print(f"{is_sequential_cpu_offload=}")

Is this expected?

sayakpaul avatar Jun 06 '25 04:06 sayakpaul

@sayakpaul yes ,you are right, the code make is_sequential_cpu_offload to be True I test the code is :

import torch
from diffusers import FluxControlPipeline
from image_gen_aux import DepthPreprocessor
from diffusers.utils import load_image
from diffusers.quantizers import PipelineQuantizationConfig

pipeline = FluxControlPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    quantization_config=PipelineQuantizationConfig(
        quant_backend="bitsandbytes_8bit", 
        quant_kwargs={"load_in_8bit": True}, 
        components_to_quantize=["transformer", "text_encoder_2"]
    ),
    torch_dtype=torch.float16,
).to("cuda")
pipeline.load_lora_weights("black-forest-labs/FLUX.1-Depth-dev-lora")

# new add for test 
transformer_is_sequential_cpu_offload = (
        isinstance(pipeline.transformer._hf_hook, AlignDevicesHook)
        or hasattr(pipeline.transformer._hf_hook, "hooks")
        and isinstance(pipeline.transformer._hf_hook.hooks[0], AlignDevicesHook)
 )
t5_is_sequential_cpu_offload = (
        isinstance(pipeline.text_encoder_2._hf_hook, AlignDevicesHook)
        or hasattr(pipeline.text_encoder_2._hf_hook, "hooks")
        and isinstance(pipeline.text_encoder_2._hf_hook.hooks[0], AlignDevicesHook)
)
print(f"transformer :{transformer_is_sequential_cpu_offload=} | t5:{t5_is_sequential_cpu_offload}")



prompt = "A robot made of exotic candies and chocolates of different kinds. The background is filled with confetti and celebratory gifts."
control_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/robot.png")

processor = DepthPreprocessor.from_pretrained("LiheYoung/depth-anything-large-hf")
control_image = processor(control_image)[0].convert("RGB")

image = pipeline(
    prompt=prompt,
    control_image=control_image,
    height=1024,
    width=1024,
    num_inference_steps=30,
    guidance_scale=10.0,
    generator=torch.Generator().manual_seed(42),
).images[0]
image.save("output.png")

and the output are both True, transformer :transformer_is_sequential_cpu_offload=True | t5:True

my question is how to quantized model with load lora , not only fp8, if int4 is support it is also good.but I have found any tutorial for it .

about your suggestion , I have two point:

  1. if support int4 load fluxcontrolpipeline and load lora, would you give me some example code ?
  2. if fp8 problem can be fix ? how to do that ?? only manual set is_sequential_cpu_offload=False in T5_text_encoder and FluxTransformer ?? and how to disable_sequential_cpu_offload() ???

thanks

Johnson-yue avatar Jun 06 '25 06:06 Johnson-yue

Hi all,

I can't speak to some of the warnings just yet, but can offer some thoughts:

  • I don't know what transformers_fp8 is.
  • I agree that we should clarify that bitsandbytes 8bit quantization is an int8 and not an fp8 format. @Johnson-yue I don't think you want to use an fp8 method anyway, as your A800 GPU likely does not offer native support for it.
  • When using bitsandbytes int8, you can try setting llm_int8_threshold to 0.0 to make it faster. In that setting it is all int8, vs the default LLM.int8() algorithm which holds back part of the matmul to be done in fp16. There may be an accuracy drop, but it could be a worthwhile tradeoff if speed is more important.

For 4bit, you can swap out bitsandbytes_8bit and load_in_8bit=True for bitsandbytes_4bit and load_in_4bit=True. In this case I would recommend using bfloat16 instead of fp16.

PipelineQuantizationConfig(
    quant_backend="bitsandbytes_4bit",
    quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16},
    components_to_quantize=["transformer", "text_encoder_2"],
)

matthewdouglas avatar Jun 06 '25 16:06 matthewdouglas

@matthewdouglas transformer_fp8 is the code below , 1)

from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, FluxTransformer2DModel

quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True)
transformer_8bit = FluxTransformer2DModel.from_pretrained(
        "black-forest-labs/FLUX.1-dev",
        subfolder="transformer",
        quantization_config=quant_config,
        torch_dtype=torch.float16,
    )

maybe I said not right

  1. even I said I use fp8 , but my code is
pipeline = FluxControlPipeline.from_pretrained(
        depth_pretrained_model_name_or_path,            #pretrained_model_name_or_path,
        quantization_config=PipelineQuantizationConfig(
            quant_backend="bitsandbytes_8bit", 
            quant_kwargs={"load_in_8bit": True}, 
            components_to_quantize=["transformer", "text_encoder_2"]
        ),
        torch_dtype=torch.float16,
    ).to("cuda")

it use bitsandbytes_8bit right , and the speed is slow as @sayakpaul said, when use bitsandbytes_8bit the is_sequential_cpu_offload set True.

I think I know why the slow speed problem. Long story short, it's stemming from here:

diffusers/src/diffusers/loaders/peft.py

Line 416 in 0f91f2f

elif is_sequential_cpu_offload: @SunMarc the above code renders the is_sequential_cpu_offload to be True. Here's a minimal snippet to reproduce:

from diffusers import AutoModel, BitsAndBytesConfig from accelerate.hooks import AlignDevicesHook import torch

model = AutoModel.from_pretrained( "black-forest-labs/FLUX.1-dev", subfolder="transformer", quantization_config=BitsAndBytesConfig(load_in_8bit=True), )

is_sequential_cpu_offload = ( isinstance(model._hf_hook, AlignDevicesHook) or hasattr(model._hf_hook, "hooks") and isinstance(model._hf_hook.hooks[0], AlignDevicesHook) ) print(f"{is_sequential_cpu_offload=}")

so the question here, How to load lora with bitsandbytes_8bit or bitsandbytes_4bit and use full GPU without any_offload

Johnson-yue avatar Jun 09 '25 02:06 Johnson-yue

Use 4bit for now as it doesn't have that problem. 8bit models seem to have that hook so that is broken when LoRA is loaded I suspect. Which is why I asked a question for @SunMarc in https://github.com/huggingface/diffusers/issues/11648#issuecomment-2948073950.

sayakpaul avatar Jun 09 '25 03:06 sayakpaul

@sayakpaul ok, I will test it later . thanks everyone

Johnson-yue avatar Jun 09 '25 05:06 Johnson-yue

Keeping it opened for https://github.com/huggingface/diffusers/issues/11648#issuecomment-2948073950.

sayakpaul avatar Jun 11 '25 03:06 sayakpaul

I test two configuration:

1) only pipeline

pipe = FluxControlPipeline.from_pretrained( "black-forest-labs/FLUX.1-Canny-dev", **pipe_cfgs)

2) pipeline + 1 lora

pipe = FluxControlPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev")
pipe.load_lora_weights("black-forest-labs/FLUX.1-Canny-dev-lora", adapter_name="Canny",**pipe_cfgs)
pipe.set_adapters("Canny", 0.85)

and I test weight type is torch.float16, bnb_8bit, bnb_4bit

# for torch.float16
pipe_cfgs = {}

# for bnb_8bit config
pipe_cfgs = {
            "quantization_config":PipelineQuantizationConfig(
                quant_backend="bitsandbytes_8bit", 
                quant_kwargs={
                    "load_in_8bit": True,
                }, 
                components_to_quantize=["transformer"]
            )
        }
# for bnb_4bit config
pipe_cfgs= {
            "quantization_config":PipelineQuantizationConfig(
                quant_backend="bitsandbytes_4bit", 
                quant_kwargs={
                    "load_in_4bit": True,
                    "bnb_4bit_use_double_quant":True,
                    "bnb_4bit_quant_type":"nf4",
                    "bnb_4bit_compute_dtype":torch.bfloat16
                    }, 
                components_to_quantize=["transformer"]
            )
        }

result is below

diffusers(fused_lora:only 1 pipeline without lora): trained_lora + fuse_lora into FluxControlPipeline bf16 - GPU: 26.61G - time:10s 8bit - GPU: 14.92G - time:13s 4bit -GPU : 9.51G - time : 1min14s , |after use "bnb_4bit_compute_dtype":torch.bfloat16 GPU=8.95s - time: 13s

  1. make fused_lora_pipeline Done
  2. make union_prompt in lora_prompts for training better union_lora
  3. diffusers with lora(pipeline + 1lora): fp16 - GPU :27.34 G - time: 11s 8bit - GPU :14.70 G - time: 3min40s 4bit - GPU : 9.68 G - time: 15s
Configuration Weight Type GPU Memory (GB) Time (s)
Only Pipeline Float16 26.61G 10s
BnB8bit 14.92G 13s
BnB4bit 8.95 G 13s
Pipeline + 1LoRA Float16 27.34G 11s
BnB8bit 14.70G 3min40s
BnB4bit 9.68G 15s

I want to know how to fix BnB_8bit only use GPU not use CPU and Does the bnb_8bit speed faster then bnb_4bit ?

Johnson-yue avatar Jun 11 '25 07:06 Johnson-yue

Traceback (most recent call last): xxxxxxxxx ImportError: cannot import name 'PipelineQuantizationConfig' from 'diffusers.quantizers' (/xxxxxxxx/miniconda3/envs/py311pt251/lib/python3.11/site-packages/diffusers/quantizers/init.py) Got an error, my version is torch251 diffusers 0.33.1

babyta avatar Jun 18 '25 07:06 babyta

Traceback (most recent call last): xxxxxxxxx ImportError: cannot import name 'PipelineQuantizationConfig' from 'diffusers.quantizers' (/xxxxxxxx/miniconda3/envs/py311pt251/lib/python3.11/site-packages/diffusers/quantizers/init.py) Got an error, my version is torch251 diffusers 0.33.1

sorry my fault pip install git+https://github.com/huggingface/diffusers solve it

babyta avatar Jun 18 '25 08:06 babyta

The module 'T5EncoderModel' has been loaded in bitsandbytes 8bit and moving it to cpu via .to() is not supported. Module is still on cuda:0. Teachers, how can I save GPU memory? I found that the memory usage is too high. enable_model_cpu_offload(gpu_id = pipe_gpu_id) This setting is useless

babyta avatar Jun 18 '25 09:06 babyta

@babyta this is an issue about a problem with 8-bit quantization with bitsandbytes, you're asking a general question about usage, please lets keep the issues to report problems with the library so we can solve them.

If you want help on how to save VRAM, we have the docs for that and if you need more help, please open a discussion (not an issue) about it, I'll try to help you or maybe someone from the community.

With 8-bit bnb quantization, you can't move the quantized model back to CPU (for now), so if you use quantization, you must ensure all the models fit in VRAM for inference, if 8-bit doesn't fit for your GPU, you'll have to use 4-bit, and if that doesn't work, you'll need to do something else rather than just quantization. What you can do depends on how much RAM do you have.

asomoza avatar Jun 18 '25 19:06 asomoza

Related issue for 8bit device movement in bitsandbytes can be tracked here: https://github.com/bitsandbytes-foundation/bitsandbytes/issues/1332

matthewdouglas avatar Jun 18 '25 19:06 matthewdouglas

Traceback (most recent call last): xxxxxxxxx ImportError: cannot import name 'PipelineQuantizationConfig' from 'diffusers.quantizers' (/xxxxxxxx/miniconda3/envs/py311pt251/lib/python3.11/site-packages/diffusers/quantizers/init.py) Got an error, my version is torch251 diffusers 0.33.1

same error

lonngxiang avatar Jun 19 '25 08:06 lonngxiang

Traceback (most recent call last): xxxxxxxxx ImportError: cannot import name 'PipelineQuantizationConfig' from 'diffusers.quantizers' (/xxxxxxxx/miniconda3/envs/py311pt251/lib/python3.11/site-packages/diffusers/quantizers/init.py) Got an error, my version is torch251 diffusers 0.33.1

same error

@lonngxiang git clone from diffusers github and run pip install -e . or pip install git+https://github.com/huggingface/diffusers

Johnson-yue avatar Jun 19 '25 12:06 Johnson-yue