how to load lora weight with fp8 transfomer model?
Hi, I want to run fluxcontrolpipeline with transformer_fp8 reference the code : https://huggingface.co/docs/diffusers/api/pipelines/flux#quantization
import torch
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, FluxTransformer2DModel, FluxControlPipeline
from transformers import BitsAndBytesConfig as BitsAndBytesConfig, T5EncoderModel
quant_config = BitsAndBytesConfig(load_in_8bit=True)
text_encoder_8bit = T5EncoderModel.from_pretrained(
"black-forest-labs/FLUX.1-dev",
subfolder="text_encoder_2",
quantization_config=quant_config,
torch_dtype=torch.float16,
)
quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True)
transformer_8bit = FluxTransformer2DModel.from_pretrained(
"black-forest-labs/FLUX.1-dev",
subfolder="transformer",
quantization_config=quant_config,
torch_dtype=torch.float16,
)
pipeline = FluxControlPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
text_encoder_2=text_encoder_8bit,
transformer=transformer_8bit,
torch_dtype=torch.float16,
device_map="balanced",
)
prompt = "a tiny astronaut hatching from an egg on the moon"
image = pipeline(prompt, guidance_scale=3.5, height=768, width=1360, num_inference_steps=50).images[0]
image.save("flux.png")
but when I load lora after build a pipeline
pipeline = FluxControlPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
text_encoder_2=text_encoder_8bit,
transformer=transformer_8bit,
torch_dtype=torch.float16,
device_map="balanced",
)
pipe.load_lora_weights("black-forest-labs/FLUX.1-Depth-dev-lora")
There a error: not support fp8 weight , how to fix it??
Could you share the stack trace from the error and the output of diffusers-cli env please? cc @sayakpaul in case this is a known problem of loading lora weights into fp8 bnb transformer
@a-r-r-o-w diffusers-cli env info :
- π€ Diffusers version: 0.34.0.dev0
- Platform: Linux-5.15.0-136-generic-x86_64-with-glibc2.31
- Running on Google Colab?: No
- Python version: 3.11.11
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.29.3
- Transformers version: 4.47.1
- Accelerate version: 1.2.1
- PEFT version: 0.15.2
- Bitsandbytes version: 0.45.0
- Safetensors version: 0.4.5
- xFormers version: 0.0.29
- Accelerator: NVIDIA A800 80GB PCIe, 81920 MiB
NVIDIA A800 80GB PCIe, 81920 MiB
NVIDIA A800 80GB PCIe, 81920 MiB
NVIDIA A800 80GB PCIe, 81920 MiB
NVIDIA A800 80GB PCIe, 81920 MiB
NVIDIA A800 80GB PCIe, 81920 MiB
NVIDIA A800 80GB PCIe, 81920 MiB
NVIDIA A800 80GB PCIe, 81920 MiB
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>
When using transformer_fp8 load lora error is
Traceback (most recent call last):
File "/data/yue/anaconda3/envs/flux/lib/python3.11/runpy.py", line 198, in _run_module_as_main
return _run_code(code, main_globals, None,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/yue/anaconda3/envs/flux/lib/python3.11/runpy.py", line 88, in _run_code
exec(code, run_globals)
File "/data/yue/.vscode-server/extensions/ms-python.debugpy-2025.8.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/__main__.py", line 71, in <module>
cli.main()
File "/data/yue/.vscode-server/extensions/ms-python.debugpy-2025.8.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 501, in main
run()
File "/data/yue/.vscode-server/extensions/ms-python.debugpy-2025.8.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 351, in run_file
runpy.run_path(target, run_name="__main__")
File "/data/yue/.vscode-server/extensions/ms-python.debugpy-2025.8.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 310, in run_path
return _run_module_code(code, init_globals, run_name, pkg_name=pkg_name, script_name=fname)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/yue/.vscode-server/extensions/ms-python.debugpy-2025.8.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 127, in _run_module_code
_run_code(code, mod_globals, init_globals, mod_name, mod_spec, pkg_name, script_name)
File "/data/yue/.vscode-server/extensions/ms-python.debugpy-2025.8.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 118, in _run_code
exec(code, run_globals)
File "/data/yue/Work_Project/Flux_Deploy/4_api_server.py", line 399, in <module>
flux_inference = FLUX_INFERENCE(pretrained_model_name_or_path, rmT5=True, transformer_fp8=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/yue/Work_Project/Flux_Deploy/4_api_server.py", line 220, in __init__
self.control_pipe.load_lora_weights(depth_lora_path, adapter_name="depth") # it must be load first
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/yue/DeepLearning/git_repos/diffusers/src/diffusers/loaders/lora_pipeline.py", line 2000, in load_lora_weights
has_param_with_expanded_shape = self._maybe_expand_transformer_param_shape_or_error_(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/yue/DeepLearning/git_repos/diffusers/src/diffusers/loaders/lora_pipeline.py", line 2490, in _maybe_expand_transformer_param_shape_or_error_
expanded_module = torch.nn.Linear(
^^^^^^^^^^^^^^^^
File "/data/yue/anaconda3/envs/flux/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 105, in __init__
self.weight = Parameter(
^^^^^^^^^^
File "/data/yue/anaconda3/envs/flux/lib/python3.11/site-packages/torch/nn/parameter.py", line 46, in __new__
return torch.Tensor._make_subclass(cls, data, requires_grad)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Only Tensors of floating point and complex dtype can require gradients
How to fix it ?? maybe use pipe.eval() but I test it and pipeline have no eval() function
https://github.com/huggingface/diffusers/pull/11655
@sayakpaul this pull request is not merge into main branch ??
@sayakpaul when I use the test code to lora with transformer_fp8, speed is too slow! and some warning out
test code :
import torch
from diffusers import FluxControlPipeline
from image_gen_aux import DepthPreprocessor
from diffusers.utils import load_image
from diffusers.quantizers import PipelineQuantizationConfig
pipeline = FluxControlPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
quantization_config=PipelineQuantizationConfig(
quant_backend="bitsandbytes_8bit",
quant_kwargs={"load_in_8bit": True},
components_to_quantize=["transformer", "text_encoder_2"]
),
torch_dtype=torch.float16,
).to("cuda")
pipeline.load_lora_weights("black-forest-labs/FLUX.1-Depth-dev-lora")
prompt = "A robot made of exotic candies and chocolates of different kinds. The background is filled with confetti and celebratory gifts."
control_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/robot.png")
processor = DepthPreprocessor.from_pretrained("LiheYoung/depth-anything-large-hf")
control_image = processor(control_image)[0].convert("RGB")
image = pipeline(
prompt=prompt,
control_image=control_image,
height=1024,
width=1024,
num_inference_steps=30,
guidance_scale=10.0,
generator=torch.Generator().manual_seed(42),
).images[0]
image.save("output.png")
and warning :
.....
flux/lib/python3.11/site-packages/torch/nn/modules/module.py:2400: UserWarning: for single_transformer_blocks.36.attn.norm_k.weight: copying from a non-meta parameter in the checkpoint to a meta parameter in the current model, which is a no-op. (Did you mean to pass `assign=True` to assign items in the state dictionary to their corresponding key in the module instead of copying them in place?)
warnings.warn(
flux/lib/python3.11/site-packages/torch/nn/modules/module.py:2400: UserWarning: for single_transformer_blocks.37.attn.norm_q.weight: copying from a non-meta parameter in the checkpoint to a meta parameter in the current model, which is a no-op. (Did you mean to pass `assign=True` to assign items in the state dictionary to their corresponding key in the module instead of copying them in place?)
warnings.warn(
flux/lib/python3.11/site-packages/torch/nn/modules/module.py:2400: UserWarning: for single_transformer_blocks.37.attn.norm_k.weight: copying from a non-meta parameter in the checkpoint to a meta parameter in the current model, which is a no-op. (Did you mean to pass `assign=True` to assign items in the state dictionary to their corresponding key in the module instead of copying them in place?)
warnings.warn(
43%|βββββββββββββββββββββββββββββββββββββββββββββββββββββ | 13/30 [06:48<08:45, 30.91s/it]
machine is A800 80G
when not using lora with transformer_fp8 using below code :
import torch
from diffusers import FluxControlPipeline
from image_gen_aux import DepthPreprocessor
from diffusers.utils import load_image
from diffusers.quantizers import PipelineQuantizationConfig
pipeline = FluxControlPipeline.from_pretrained(
"black-forest-labs/FLUX.1-Depth-dev",
quantization_config=PipelineQuantizationConfig(
quant_backend="bitsandbytes_8bit",
quant_kwargs={"load_in_8bit": True},
components_to_quantize=["transformer", "text_encoder_2"]
),
torch_dtype=torch.float16,
).to("cuda")
# pipeline.load_lora_weights("black-forest-labs/FLUX.1-Depth-dev-lora")
prompt = "A robot made of exotic candies and chocolates of different kinds. The background is filled with confetti and celebratory gifts."
control_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/robot.png")
processor = DepthPreprocessor.from_pretrained("LiheYoung/depth-anything-large-hf")
control_image = processor(control_image)[0].convert("RGB")
image = pipeline(
prompt=prompt,
control_image=control_image,
height=1024,
width=1024,
num_inference_steps=30,
guidance_scale=10.0,
generator=torch.Generator().manual_seed(42),
).images[0]
image.save("output.png")
This pipeline runtime faster than load lora
The module 'T5EncoderModel' has been loaded in `bitsandbytes` 8bit and moving it to cuda via `.to()` is not supported. Module is still on cuda:0.
The module 'FluxTransformer2DModel' has been loaded in `bitsandbytes` 8bit and moving it to cuda via `.to()` is not supported. Module is still on cuda:0.
100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 30/30 [00:25<00:00, 1.18it/s]
Could it be because when we load the LoRA into a 8bit base model, there's a single layer that is not quantized. It was a conscious decision we took when implementing this feature for the 4-bit use case as well. Can you try with 4bit and see what happens. Also, for what it's worth, can we update the description of the PR from fp8 to int8? When we leverage 8bit in bitsandbytes, it's following the LLM.int8() scheme. It is NOT FP8.
Tagging @matthewdouglas from the bitsandbytes team for the warning being faced.
Btw, the same slow speed problem happens in the regular LoRA case too:
import torch
from diffusers import DiffusionPipeline
from diffusers.quantizers import PipelineQuantizationConfig
pipeline = DiffusionPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
quantization_config=PipelineQuantizationConfig(
quant_backend="bitsandbytes_8bit",
quant_kwargs={"load_in_8bit": True},
components_to_quantize=["transformer", "text_encoder_2"]
),
torch_dtype=torch.float16,
).to("cuda")
pipeline.load_lora_weights("Purz/choose-your-own-adventure")
prompt = "cy04 A robot made of exotic candies and chocolates of different kinds. The background is filled with confetti and celebratory gifts."
image = pipeline(
prompt=prompt,
height=1024,
width=1024,
num_inference_steps=30,
guidance_scale=3.5,
generator=torch.Generator().manual_seed(42),
).images[0]
image.save("output.png")
So, I am not sure if it's a specific problem or a general problem with 8bit models + LoRA. I am on PyTorch 2.7 and bitsandbytes 0.46.0.
I think I know why the slow speed problem. Long story short, it's stemming from here:
https://github.com/huggingface/diffusers/blob/0f91f2f6fc697f01ca6da6724e2b3b5600b56a9b/src/diffusers/loaders/peft.py#L416
@SunMarc the above code renders the is_sequential_cpu_offload to be True. Here's a minimal snippet to reproduce:
from diffusers import AutoModel, BitsAndBytesConfig
from accelerate.hooks import AlignDevicesHook
import torch
model = AutoModel.from_pretrained(
"black-forest-labs/FLUX.1-dev",
subfolder="transformer",
quantization_config=BitsAndBytesConfig(load_in_8bit=True),
)
is_sequential_cpu_offload = (
isinstance(model._hf_hook, AlignDevicesHook)
or hasattr(model._hf_hook, "hooks")
and isinstance(model._hf_hook.hooks[0], AlignDevicesHook)
)
print(f"{is_sequential_cpu_offload=}")
Is this expected?
@sayakpaul yes ,you are right, the code make is_sequential_cpu_offload to be True
I test the code is :
import torch
from diffusers import FluxControlPipeline
from image_gen_aux import DepthPreprocessor
from diffusers.utils import load_image
from diffusers.quantizers import PipelineQuantizationConfig
pipeline = FluxControlPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
quantization_config=PipelineQuantizationConfig(
quant_backend="bitsandbytes_8bit",
quant_kwargs={"load_in_8bit": True},
components_to_quantize=["transformer", "text_encoder_2"]
),
torch_dtype=torch.float16,
).to("cuda")
pipeline.load_lora_weights("black-forest-labs/FLUX.1-Depth-dev-lora")
# new add for test
transformer_is_sequential_cpu_offload = (
isinstance(pipeline.transformer._hf_hook, AlignDevicesHook)
or hasattr(pipeline.transformer._hf_hook, "hooks")
and isinstance(pipeline.transformer._hf_hook.hooks[0], AlignDevicesHook)
)
t5_is_sequential_cpu_offload = (
isinstance(pipeline.text_encoder_2._hf_hook, AlignDevicesHook)
or hasattr(pipeline.text_encoder_2._hf_hook, "hooks")
and isinstance(pipeline.text_encoder_2._hf_hook.hooks[0], AlignDevicesHook)
)
print(f"transformer :{transformer_is_sequential_cpu_offload=} | t5:{t5_is_sequential_cpu_offload}")
prompt = "A robot made of exotic candies and chocolates of different kinds. The background is filled with confetti and celebratory gifts."
control_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/robot.png")
processor = DepthPreprocessor.from_pretrained("LiheYoung/depth-anything-large-hf")
control_image = processor(control_image)[0].convert("RGB")
image = pipeline(
prompt=prompt,
control_image=control_image,
height=1024,
width=1024,
num_inference_steps=30,
guidance_scale=10.0,
generator=torch.Generator().manual_seed(42),
).images[0]
image.save("output.png")
and the output are both True, transformer :transformer_is_sequential_cpu_offload=True | t5:True
my question is how to quantized model with load lora , not only fp8, if int4 is support it is also good.but I have found any tutorial for it .
about your suggestion , I have two point:
- if support int4 load fluxcontrolpipeline and load lora, would you give me some example code ?
- if fp8 problem can be fix ? how to do that ?? only manual set
is_sequential_cpu_offload=Falsein T5_text_encoder and FluxTransformer ?? and how to disable_sequential_cpu_offload() ???
thanks
Hi all,
I can't speak to some of the warnings just yet, but can offer some thoughts:
- I don't know what
transformers_fp8is. - I agree that we should clarify that bitsandbytes 8bit quantization is an int8 and not an fp8 format. @Johnson-yue I don't think you want to use an fp8 method anyway, as your A800 GPU likely does not offer native support for it.
- When using bitsandbytes int8, you can try setting
llm_int8_thresholdto 0.0 to make it faster. In that setting it is all int8, vs the default LLM.int8() algorithm which holds back part of the matmul to be done in fp16. There may be an accuracy drop, but it could be a worthwhile tradeoff if speed is more important.
For 4bit, you can swap out bitsandbytes_8bit and load_in_8bit=True for bitsandbytes_4bit and load_in_4bit=True. In this case I would recommend using bfloat16 instead of fp16.
PipelineQuantizationConfig(
quant_backend="bitsandbytes_4bit",
quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16},
components_to_quantize=["transformer", "text_encoder_2"],
)
@matthewdouglas transformer_fp8 is the code below ,
1)
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, FluxTransformer2DModel
quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True)
transformer_8bit = FluxTransformer2DModel.from_pretrained(
"black-forest-labs/FLUX.1-dev",
subfolder="transformer",
quantization_config=quant_config,
torch_dtype=torch.float16,
)
maybe I said not right
- even I said I use fp8 , but my code is
pipeline = FluxControlPipeline.from_pretrained(
depth_pretrained_model_name_or_path, #pretrained_model_name_or_path,
quantization_config=PipelineQuantizationConfig(
quant_backend="bitsandbytes_8bit",
quant_kwargs={"load_in_8bit": True},
components_to_quantize=["transformer", "text_encoder_2"]
),
torch_dtype=torch.float16,
).to("cuda")
it use bitsandbytes_8bit right , and the speed is slow as @sayakpaul said, when use bitsandbytes_8bit the is_sequential_cpu_offload set True.
I think I know why the slow speed problem. Long story short, it's stemming from here:
diffusers/src/diffusers/loaders/peft.py
Line 416 in 0f91f2f
elif is_sequential_cpu_offload: @SunMarc the above code renders the
is_sequential_cpu_offloadto be True. Here's a minimal snippet to reproduceοΌfrom diffusers import AutoModel, BitsAndBytesConfig from accelerate.hooks import AlignDevicesHook import torch
model = AutoModel.from_pretrained( "black-forest-labs/FLUX.1-dev", subfolder="transformer", quantization_config=BitsAndBytesConfig(load_in_8bit=True), )
is_sequential_cpu_offload = ( isinstance(model._hf_hook, AlignDevicesHook) or hasattr(model._hf_hook, "hooks") and isinstance(model._hf_hook.hooks[0], AlignDevicesHook) ) print(f"{is_sequential_cpu_offload=}")
so the question here, How to load lora with bitsandbytes_8bit or bitsandbytes_4bit and use full GPU without any_offload
Use 4bit for now as it doesn't have that problem. 8bit models seem to have that hook so that is broken when LoRA is loaded I suspect. Which is why I asked a question for @SunMarc in https://github.com/huggingface/diffusers/issues/11648#issuecomment-2948073950.
@sayakpaul ok, I will test it later . thanks everyone
Keeping it opened for https://github.com/huggingface/diffusers/issues/11648#issuecomment-2948073950.
I test two configuration:
1) only pipeline
pipe = FluxControlPipeline.from_pretrained( "black-forest-labs/FLUX.1-Canny-dev", **pipe_cfgs)
2) pipeline + 1 lora
pipe = FluxControlPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev")
pipe.load_lora_weights("black-forest-labs/FLUX.1-Canny-dev-lora", adapter_name="Canny",**pipe_cfgs)
pipe.set_adapters("Canny", 0.85)
and I test weight type is torch.float16, bnb_8bit, bnb_4bit
# for torch.float16
pipe_cfgs = {}
# for bnb_8bit config
pipe_cfgs = {
"quantization_config":PipelineQuantizationConfig(
quant_backend="bitsandbytes_8bit",
quant_kwargs={
"load_in_8bit": True,
},
components_to_quantize=["transformer"]
)
}
# for bnb_4bit config
pipe_cfgs= {
"quantization_config":PipelineQuantizationConfig(
quant_backend="bitsandbytes_4bit",
quant_kwargs={
"load_in_4bit": True,
"bnb_4bit_use_double_quant":True,
"bnb_4bit_quant_type":"nf4",
"bnb_4bit_compute_dtype":torch.bfloat16
},
components_to_quantize=["transformer"]
)
}
result is below
diffusers(fused_lora:only 1 pipeline without lora):
trained_lora + fuse_lora into FluxControlPipeline
bf16 - GPU: 26.61G - time:10s
8bit - GPU: 14.92G - time:13s
4bit -GPU : 9.51G - time : 1min14s , |after use "bnb_4bit_compute_dtype":torch.bfloat16 GPU=8.95s - time: 13s
- make fused_lora_pipeline Done
- make union_prompt in lora_prompts for training better union_lora
- diffusers with lora(pipeline + 1lora): fp16 - GPU :27.34 G - time: 11s 8bit - GPU :14.70 G - time: 3min40s 4bit - GPU : 9.68 G - time: 15s
| Configuration | Weight Type | GPU Memory (GB) | Time (s) |
|---|---|---|---|
| Only Pipeline | Float16 | 26.61G | 10s |
| BnB8bit | 14.92G | 13s | |
| BnB4bit | 8.95 G | 13s | |
| Pipeline + 1LoRA | Float16 | 27.34G | 11s |
| BnB8bit | 14.70G | 3min40s | |
| BnB4bit | 9.68G | 15s |
I want to know how to fix BnB_8bit only use GPU not use CPU and Does the bnb_8bit speed faster then bnb_4bit ?
Traceback (most recent call last): xxxxxxxxx ImportError: cannot import name 'PipelineQuantizationConfig' from 'diffusers.quantizers' (/xxxxxxxx/miniconda3/envs/py311pt251/lib/python3.11/site-packages/diffusers/quantizers/init.py) Got an error, my version is torch251 diffusers 0.33.1
Traceback (most recent call last): xxxxxxxxx ImportError: cannot import name 'PipelineQuantizationConfig' from 'diffusers.quantizers' (/xxxxxxxx/miniconda3/envs/py311pt251/lib/python3.11/site-packages/diffusers/quantizers/init.py) Got an error, my version is torch251 diffusers 0.33.1
sorry my fault pip install git+https://github.com/huggingface/diffusers solve it
The module 'T5EncoderModel' has been loaded in bitsandbytes 8bit and moving it to cpu via .to() is not supported. Module is still on cuda:0. Teachers, how can I save GPU memory? I found that the memory usage is too high. enable_model_cpu_offload(gpu_id = pipe_gpu_id) This setting is useless
@babyta this is an issue about a problem with 8-bit quantization with bitsandbytes, you're asking a general question about usage, please lets keep the issues to report problems with the library so we can solve them.
If you want help on how to save VRAM, we have the docs for that and if you need more help, please open a discussion (not an issue) about it, I'll try to help you or maybe someone from the community.
With 8-bit bnb quantization, you can't move the quantized model back to CPU (for now), so if you use quantization, you must ensure all the models fit in VRAM for inference, if 8-bit doesn't fit for your GPU, you'll have to use 4-bit, and if that doesn't work, you'll need to do something else rather than just quantization. What you can do depends on how much RAM do you have.
Related issue for 8bit device movement in bitsandbytes can be tracked here: https://github.com/bitsandbytes-foundation/bitsandbytes/issues/1332
Traceback (most recent call last): xxxxxxxxx ImportError: cannot import name 'PipelineQuantizationConfig' from 'diffusers.quantizers' (/xxxxxxxx/miniconda3/envs/py311pt251/lib/python3.11/site-packages/diffusers/quantizers/init.py) Got an error, my version is torch251 diffusers 0.33.1
same error
Traceback (most recent call last): xxxxxxxxx ImportError: cannot import name 'PipelineQuantizationConfig' from 'diffusers.quantizers' (/xxxxxxxx/miniconda3/envs/py311pt251/lib/python3.11/site-packages/diffusers/quantizers/init.py) Got an error, my version is torch251 diffusers 0.33.1
same error
@lonngxiang git clone from diffusers github and run pip install -e . or pip install git+https://github.com/huggingface/diffusers