diffusers icon indicating copy to clipboard operation
diffusers copied to clipboard

confusion when train dreambooth via script train_dreambooth_lora_sdxl_advanced

Open joey0922 opened this issue 1 year ago • 19 comments

I used the train_dreambooth_lora_sdxl_advanced script downloaded from diffusers' official exmaples to train my own images, however, both the validation images generated and the images generated by trained model are of bad quality. They are always very indistinct. I used the script offered by diffsuers examples and didn't change any parameters. Why did this happen? How can I fix it?

joey0922 avatar Apr 25 '24 11:04 joey0922

Have you read this blog post: LoRA training scripts of the world, unite! It might be helpful.

tolgacangoz avatar Apr 25 '24 11:04 tolgacangoz

Have you read this blog post: LoRA training scripts of the world, unite! It might be helpful.

Yeah, I have read and try it, but the generated images are still blurry, even broken. They seem not finished generating. Maybe it is because of my vae? I have no idea how to fix it.

joey0922 avatar Apr 26 '24 02:04 joey0922

the first thing that comes to my mind it's the dataset, are you using your own dataset?

asomoza avatar Apr 26 '24 05:04 asomoza

the first thing that comes to my mind it's the dataset, are you using your own dataset?

yeah, I used my own dataset. I think you are right. I found the quality of generated images are better when I trained model with more images. May I have your contact pls? So I can ask you for advice when I have problems. Thanks!

joey0922 avatar Apr 26 '24 07:04 joey0922

the first thing that comes to my mind it's the dataset, are you using your own dataset?

yeah, I used my own dataset. I think you are right. I found the quality of generated images are better when I trained model with more images. May I have your contact pls? So I can ask you for advice when I have problems. Thanks!

@asomoza Are there any requiements for training data in terms of volume and quality?I used about 30 images to train the model, and it turned out better but still blurry. It is because my dataset is not good enough? The images I used are downloaded from internet. Though they have different perspectives and various backgrounds.

I found a parameter named cross_attention_kwargs in inference, when a set "scale" key of it lower the generated image will be of higher quality, meanwhile the subject is less similar. Do you have any idea? Increasing the volume of train data still can't fix the problem completely.

joey0922 avatar Apr 26 '24 11:04 joey0922

May I have your contact pls? So I can ask you for advice when I have problems. Thanks!

I'll rather answer here so other people can also learn and as a reminder to myself too.

Are there any requiements for training data in terms of volume and quality?

For dreambooth there isn't a requirement of number of images, it can be as little as 10 or less, it all depends on what you want to train and the results you want.

For example to train a style you'll need a lot more images so the model learn it and doesn't associate anything from the images as the style. If you just want it to learn a closeup of a face, it can be even one image but if you want it to learn a person in various poses and distances you'll need at least a couple of images of each one.

About the quality, since you're training for SDXL, you'll need the images to be of 1024x1024 pixels or higher, until you learn to train and have a couple of good trainings don't go lower. Same for aspect ratio and data augmentation, until you learn how to train, just use square images and crop them yourself, make sure that each image has what you want and what's in the caption you're providing.

You can use blurry or lower quality images if you caption them as so, but in my experience it's better to just remove them from the dataset.

I found a parameter named cross_attention_kwargs in inference, when a set "scale" key of it lower the generated image will be of higher quality, meanwhile the subject is less similar.

You can learn about this in the docs: https://huggingface.co/docs/diffusers/tutorials/using_peft_for_inference#merge-adapters.

But essentially what you're doing is lowering the effect of the LoRA and letting the base model affect more of the image.

Is not a requirement but I like the LoRAs to be good at 1.0 scale, that means that your LoRA should work as expected without the need of making the scale higher or lower than 1.0

In my opinion, if you have to lower the scale of the LoRA it's over trained, but there's a lot of people that just lower the scale instead of retraining them.

asomoza avatar Apr 26 '24 13:04 asomoza

May I have your contact pls? So I can ask you for advice when I have problems. Thanks!

I'll rather answer here so other people can also learn and as a reminder to myself too.

Are there any requiements for training data in terms of volume and quality?

For dreambooth there isn't a requirement of number of images, it can be as little as 10 or less, it all depends on what you want to train and the results you want.

For example to train a style you'll need a lot more images so the model learn it and doesn't associate anything from the images as the style. If you just want it to learn a closeup of a face, it can be even one image but if you want it to learn a person in various poses and distances you'll need at least a couple of images of each one.

About the quality, since you're training for SDXL, you'll need the images to be of 1024x1024 pixels or higher, until you learn to train and have a couple of good trainings don't go lower. Same for aspect ratio and data augmentation, until you learn how to train, just use square images and crop them yourself, make sure that each image has what you want and what's in the caption you're providing.

You can use blurry or lower quality images if you caption them as so, but in my experience it's better to just remove them from the dataset.

I found a parameter named cross_attention_kwargs in inference, when a set "scale" key of it lower the generated image will be of higher quality, meanwhile the subject is less similar.

You can learn about this in the docs: https://huggingface.co/docs/diffusers/tutorials/using_peft_for_inference#merge-adapters.

But essentially what you're doing is lowering the effect of the LoRA and letting the base model affect more of the image.

Is not a requirement but I like the LoRAs to be good at 1.0 scale, that means that your LoRA should work as expected without the need of making the scale higher or lower than 1.0

In my opinion, if you have to lower the scale of the LoRA it's over trained, but there's a lot of people that just lower the scale instead of retraining them.

Thanks for your answer. I will give it a try. I intend to generate a specified car that base model didn’t see before with background. I have consult with you and followed your outpainting guide to do it. But I found the car can’t blend in the background naturally by in painting. I have tried to train Dreambooth Lora SDXL version, the quality of image was good but the generated car was always not so similar to the original car. I have no idea if it was because I refine the generated image to improve the image quality. So I want to try the script in advanced diffusion training. It is also able to train a multiple concept Lora model. Besides the dataset, what kinds of things I may be able to fix this? How can I adjust the hyper parameters to find the sweet spot?

joey0922 avatar Apr 26 '24 16:04 joey0922

For the best common practices you can refer to the documentation and the blog post that @standardAI linked before.

Sadly there's not an easy and surefire way to find the sweet spot, to many variables to take into account for the training and the generation.

All the good LoRA trainers I know, have trained hundreds of LoRAs before getting as good as they are now, it's just something that you learn by doing it a lot of times.

Having said that, if you're doing this for a professional kind of work, I don't think we're yet in a position to delegate everything to the AI , you'll need a combination of AI and manual work (a lot) to get a good results if you need the original car to blend perfectly in the background without distorting it.

Personally I haven't seen someone working with real cars, I think they're part of the things AI is bad at like hands and logos because they have so many little details that can go bad, you should be okay with simple generic cars though.

When I have the time I'll test a LoRA with car images (with creative common license) but it's not going to be soon.

As a last resort maybe you can write to some of the https://civitai.com/tag/car creators if you find a good LoRA that works the way you need it.

asomoza avatar Apr 26 '24 18:04 asomoza

For the best common practices you can refer to the documentation and the blog post that @standardAI linked before.

Sadly there's not an easy and surefire way to find the sweet spot, to many variables to take into account for the training and the generation.

All the good LoRA trainers I know, have trained hundreds of LoRAs before getting as good as they are now, it's just something that you learn by doing it a lot of times.

Having said that, if you're doing this for a professional kind of work, I don't think we're yet in a position to delegate everything to the AI , you'll need a combination of AI and manual work (a lot) to get a good results if you need the original car to blend perfectly in the background without distorting it.

Personally I haven't seen someone working with real cars, I think they're part of the things AI is bad at like hands and logos because they have so many little details that can go bad, you should be okay with simple generic cars though.

When I have the time I'll test a LoRA with car images (with creative common license) but it's not going to be soon.

As a last resort maybe you can write to some of the https://civitai.com/tag/car creators if you find a good LoRA that works the way you need it.

Got it, thanks for your help. I will keep trying.

joey0922 avatar Apr 26 '24 23:04 joey0922

@asomoza Hi, I used high quality images to train the model and it turned out much better. Though some tiny details are not perfect but it is acceptable. I have another question here, I want to use this script to train multiple concepts(or subjects) into one model, in my case, the concepts(or subjects) are different types of cars, but I didn't find any instruction. Do you know how to do this? Many thanks!

joey0922 avatar Apr 29 '24 08:04 joey0922

@asomoza Hi, I used high quality images to train the model and it turned out much better. Though some tiny details are not perfect but it is acceptable. I have another question here, I want to use this script to train multiple concepts(or subjects) into one model, in my case, the concepts(or subjects) are different types of cars, but I didn't find any instruction. Do you know how to do this? Many thanks!

@asomoza sorry to bothering. I figured it out myself. But I found when a load lora model with base model, it seldomly can change the style of generated image, like painting, cartoon style and so on. It keeps generate a photo nearly 100%. Why did this happen? What should I do to keep the ability of base model's style changing? Thanks.

joey0922 avatar Apr 29 '24 09:04 joey0922

Hi, I'm glad you're progressing towards your goal.

For what you want, you have multiple options, one is to just lower the scale of the lora until you see the effect of the base model, also it helps if you suggest it in the prompt, but you can lose some of the original detail of the lora.

Also, when I train LoRAs I save multiple epochs and test them with styles in the prompt and different models, usually I choose one where I can use it 1.0, can change styles and gives the results I want, but that's the ideal LoRA training and can be hard to achieve at first when you're learning.

Another method is to use IP Adapters with InstantStyle to force the model to apply the style, you can read about how to do it here: https://huggingface.co/docs/diffusers/main/en/using-diffusers/ip_adapter#style--layout-control

Finally you can train your LoRA with that, in the captions or tags, add the media format like photo, cartoon, anime, drawing for it, but you'll need to provide also images with this styles for the model to learn. If you're using common styles, you'll just need a couple of them so the model just learns that it can change it.

asomoza avatar Apr 29 '24 17:04 asomoza

Hi, I'm glad you're progressing towards your goal.

For what you want, you have multiple options, one is to just lower the scale of the lora until you see the effect of the base model, also it helps if you suggest it in the prompt, but you can lose some of the original detail of the lora.

Also, when I train LoRAs I save multiple epochs and test them with styles in the prompt and different models, usually I choose one where I can use it 1.0, can change styles and gives the results I want, but that's the ideal LoRA training and can be hard to achieve at first when you're learning.

Another method is to use IP Adapters with InstantStyle to force the model to apply the style, you can read about how to do it here: https://huggingface.co/docs/diffusers/main/en/using-diffusers/ip_adapter#style--layout-control

Finally you can train your LoRA with that, in the captions or tags, add the media format like photo, cartoon, anime, drawing for it, but you'll need to provide also images with this styles for the model to learn. If you're using common styles, you'll just need a couple of them so the model just learns that it can change it.

Thanks for you reply. I tried the ip adapter method and found a typeerror, it said unsupported operand type(s) for *: 'dict' and 'Tensor'. How to fix it? I know it is because of the scale type, but I have no idea how to fix it.

joey0922 avatar May 06 '24 04:05 joey0922

I'll need a reproducible code snippet and the error log to see where the error is happening to be able to help you with that.

asomoza avatar May 06 '24 15:05 asomoza

I'll need a reproducible code snippet and the error log to see where the error is happening to be able to help you with that.

of course, my code is blow: `import torch from diffusers.utils import load_image from diffusers import DiffusionPipeline

model_name_or_path = "stable-diffusion-xl-base-1.0" pipe = DiffusionPipeline.from_pretrained( model_name_or_path, torch_dtype=torch.float16, variant="fp16" ).to("cuda") ip_adapter = "ip-adapter" pipe.load_ip_adapter(ip_adapter, subfolder="sdxl_models", weight_name="ip-adapter_sdxl.bin")

scale = { "up": {"block_0": [0.0, 1.0, 0.0]}, } pipe.set_ip_adapter_scale(scale)

style_image = load_image("style.png")

prompt = "a car in street of a modern city at night along with towers and buildings" negative_prompt = "" image = pipe(prompt=prompt, negative_prompt=negative_prompt, num_inference_steps=25, num_images_per_prompt=1, ip_adapter_image=style_image, guidance_scale=5, width=1152, height=720).images[0]`

and the error message is as following: TypeError Traceback (most recent call last) /home/joey/workspace/AIGC/finetuning/inference.ipynb 单元格 31 line 3 1 prompt = "a car in street of a modern city at night along with towers and buildings" 2 negative_prompt = "" ----> 3 image = pipe(prompt=prompt, negative_prompt=negative_prompt, num_inference_steps=25, num_images_per_prompt=1, 4 ip_adapter_image=style_image, guidance_scale=5, width=1152, height=720).images[0]

File ~/anaconda3/envs/aigc/lib/python3.10/site-packages/torch/utils/_contextlib.py:115, in context_decorator..decorate_context(*args, **kwargs) 112 @functools.wraps(func) 113 def decorate_context(*args, **kwargs): 114 with ctx_factory(): --> 115 return func(*args, **kwargs)

File ~/anaconda3/envs/aigc/lib/python3.10/site-packages/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py:1176, in StableDiffusionXLPipeline.call(self, prompt, prompt_2, height, width, num_inference_steps, timesteps, denoising_end, guidance_scale, negative_prompt, negative_prompt_2, num_images_per_prompt, eta, generator, latents, prompt_embeds, negative_prompt_embeds, pooled_prompt_embeds, negative_pooled_prompt_embeds, ip_adapter_image, ip_adapter_image_embeds, output_type, return_dict, cross_attention_kwargs, guidance_rescale, original_size, crops_coords_top_left, target_size, negative_original_size, negative_crops_coords_top_left, negative_target_size, clip_skip, callback_on_step_end, callback_on_step_end_tensor_inputs, **kwargs) 1174 if ip_adapter_image is not None or ip_adapter_image_embeds is not None: 1175 added_cond_kwargs["image_embeds"] = image_embeds -> 1176 noise_pred = self.unet( 1177 latent_model_input, 1178 t, 1179 encoder_hidden_states=prompt_embeds, 1180 timestep_cond=timestep_cond, 1181 cross_attention_kwargs=self.cross_attention_kwargs, 1182 added_cond_kwargs=added_cond_kwargs, 1183 return_dict=False, 1184 )[0] 1186 # perform guidance 1187 if self.do_classifier_free_guidance:

File ~/anaconda3/envs/aigc/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs) 1509 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc] 1510 else: -> 1511 return self._call_impl(*args, **kwargs)

File ~/anaconda3/envs/aigc/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs) 1515 # If we don't have any hooks, we want to skip the rest of the logic in 1516 # this function, and just call forward. 1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks 1518 or _global_backward_pre_hooks or _global_backward_hooks 1519 or _global_forward_hooks or _global_forward_pre_hooks): -> 1520 return forward_call(*args, **kwargs) 1522 try: 1523 result = None

File ~/anaconda3/envs/aigc/lib/python3.10/site-packages/diffusers/models/unets/unet_2d_condition.py:1216, in UNet2DConditionModel.forward(self, sample, timestep, encoder_hidden_states, class_labels, timestep_cond, attention_mask, cross_attention_kwargs, added_cond_kwargs, down_block_additional_residuals, mid_block_additional_residual, down_intrablock_additional_residuals, encoder_attention_mask, return_dict) 1213 if is_adapter and len(down_intrablock_additional_residuals) > 0: 1214 additional_residuals["additional_residuals"] = down_intrablock_additional_residuals.pop(0) -> 1216 sample, res_samples = downsample_block( 1217 hidden_states=sample, 1218 temb=emb, 1219 encoder_hidden_states=encoder_hidden_states, 1220 attention_mask=attention_mask, 1221 cross_attention_kwargs=cross_attention_kwargs, 1222 encoder_attention_mask=encoder_attention_mask, 1223 **additional_residuals, 1224 ) 1225 else: 1226 sample, res_samples = downsample_block(hidden_states=sample, temb=emb)

File ~/anaconda3/envs/aigc/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs) 1509 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc] 1510 else: -> 1511 return self._call_impl(*args, **kwargs)

File ~/anaconda3/envs/aigc/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs) 1515 # If we don't have any hooks, we want to skip the rest of the logic in 1516 # this function, and just call forward. 1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks 1518 or _global_backward_pre_hooks or _global_backward_hooks 1519 or _global_forward_hooks or _global_forward_pre_hooks): -> 1520 return forward_call(*args, **kwargs) 1522 try: 1523 result = None

File ~/anaconda3/envs/aigc/lib/python3.10/site-packages/diffusers/models/unets/unet_2d_blocks.py:1279, in CrossAttnDownBlock2D.forward(self, hidden_states, temb, encoder_hidden_states, attention_mask, cross_attention_kwargs, encoder_attention_mask, additional_residuals) 1277 else: 1278 hidden_states = resnet(hidden_states, temb) -> 1279 hidden_states = attn( 1280 hidden_states, 1281 encoder_hidden_states=encoder_hidden_states, 1282 cross_attention_kwargs=cross_attention_kwargs, 1283 attention_mask=attention_mask, 1284 encoder_attention_mask=encoder_attention_mask, 1285 return_dict=False, 1286 )[0] 1288 # apply additional residuals to the output of the last pair of resnet and attention blocks 1289 if i == len(blocks) - 1 and additional_residuals is not None:

File ~/anaconda3/envs/aigc/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs) 1509 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc] 1510 else: -> 1511 return self._call_impl(*args, **kwargs)

File ~/anaconda3/envs/aigc/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs) 1515 # If we don't have any hooks, we want to skip the rest of the logic in 1516 # this function, and just call forward. 1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks 1518 or _global_backward_pre_hooks or _global_backward_hooks 1519 or _global_forward_hooks or _global_forward_pre_hooks): -> 1520 return forward_call(*args, **kwargs) 1522 try: 1523 result = None

File ~/anaconda3/envs/aigc/lib/python3.10/site-packages/diffusers/models/transformers/transformer_2d.py:397, in Transformer2DModel.forward(self, hidden_states, encoder_hidden_states, timestep, added_cond_kwargs, class_labels, cross_attention_kwargs, attention_mask, encoder_attention_mask, return_dict) 385 hidden_states = torch.utils.checkpoint.checkpoint( 386 create_custom_forward(block), 387 hidden_states, (...) 394 **ckpt_kwargs, 395 ) 396 else: --> 397 hidden_states = block( 398 hidden_states, 399 attention_mask=attention_mask, 400 encoder_hidden_states=encoder_hidden_states, 401 encoder_attention_mask=encoder_attention_mask, 402 timestep=timestep, 403 cross_attention_kwargs=cross_attention_kwargs, 404 class_labels=class_labels, 405 ) 407 # 3. Output 408 if self.is_input_continuous:

File ~/anaconda3/envs/aigc/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs) 1509 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc] 1510 else: -> 1511 return self._call_impl(*args, **kwargs)

File ~/anaconda3/envs/aigc/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs) 1515 # If we don't have any hooks, we want to skip the rest of the logic in 1516 # this function, and just call forward. 1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks 1518 or _global_backward_pre_hooks or _global_backward_hooks 1519 or _global_forward_hooks or _global_forward_pre_hooks): -> 1520 return forward_call(*args, **kwargs) 1522 try: 1523 result = None

File ~/anaconda3/envs/aigc/lib/python3.10/site-packages/diffusers/models/attention.py:366, in BasicTransformerBlock.forward(self, hidden_states, attention_mask, encoder_hidden_states, encoder_attention_mask, timestep, cross_attention_kwargs, class_labels, added_cond_kwargs) 363 if self.pos_embed is not None and self.norm_type != "ada_norm_single": 364 norm_hidden_states = self.pos_embed(norm_hidden_states) --> 366 attn_output = self.attn2( 367 norm_hidden_states, 368 encoder_hidden_states=encoder_hidden_states, 369 attention_mask=encoder_attention_mask, 370 **cross_attention_kwargs, 371 ) 372 hidden_states = attn_output + hidden_states 374 # 4. Feed-forward 375 # i2vgen doesn't have this norm 🤷‍♂️

File ~/anaconda3/envs/aigc/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs) 1509 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc] 1510 else: -> 1511 return self._call_impl(*args, **kwargs)

File ~/anaconda3/envs/aigc/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs) 1515 # If we don't have any hooks, we want to skip the rest of the logic in 1516 # this function, and just call forward. 1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks 1518 or _global_backward_pre_hooks or _global_backward_hooks 1519 or _global_forward_hooks or _global_forward_pre_hooks): -> 1520 return forward_call(*args, **kwargs) 1522 try: 1523 result = None

File ~/anaconda3/envs/aigc/lib/python3.10/site-packages/diffusers/models/attention_processor.py:522, in Attention.forward(self, hidden_states, encoder_hidden_states, attention_mask, **cross_attention_kwargs) 517 logger.warning( 518 f"cross_attention_kwargs {unused_kwargs} are not expected by {self.processor.class.name} and will be ignored." 519 ) 520 cross_attention_kwargs = {k: w for k, w in cross_attention_kwargs.items() if k in attn_parameters} --> 522 return self.processor( 523 self, 524 hidden_states, 525 encoder_hidden_states=encoder_hidden_states, 526 attention_mask=attention_mask, 527 **cross_attention_kwargs, 528 )

File ~/anaconda3/envs/aigc/lib/python3.10/site-packages/diffusers/models/attention_processor.py:2417, in IPAdapterAttnProcessor2_0.call(self, attn, hidden_states, encoder_hidden_states, attention_mask, temb, scale, ip_adapter_masks) 2413 mask_downsample = mask_downsample.to(dtype=query.dtype, device=query.device) 2415 current_ip_hidden_states = current_ip_hidden_states * mask_downsample -> 2417 hidden_states = hidden_states + scale * current_ip_hidden_states 2419 # linear proj 2420 hidden_states = attn.to_out0

TypeError: unsupported operand type(s) for *: 'dict' and 'Tensor'

I downloaded all the pretrained models manually, so I loaded them locally. When the type of scale is float, it worked normally. It responsed this error when I passed a scale as type of dict.

joey0922 avatar May 07 '24 07:05 joey0922

I don't see anything wrong with your code, but use the vit-h IP Adapter instead, I really don't use the one you're using, it's more of a waste of bandwidth to use that one.

pipeline.load_ip_adapter(
    "h94/IP-Adapter",
    subfolder="sdxl_models",
    weight_name="ip-adapter_sdxl_vit-h.safetensors",
    image_encoder_folder="models/image_encoder",
)

You're using a dict to set the scales of the IP Adapter, and that's something new, so make sure you're using diffusers from source. Probably that's why you're getting that error.

asomoza avatar May 08 '24 08:05 asomoza

I don't see anything wrong with your code, but use the vit-h IP Adapter instead, I really don't use the one you're using, it's more of a waste of bandwidth to use that one.

pipeline.load_ip_adapter(
    "h94/IP-Adapter",
    subfolder="sdxl_models",
    weight_name="ip-adapter_sdxl_vit-h.safetensors",
    image_encoder_folder="models/image_encoder",
)

You're using a dict to set the scales of the IP Adapter, and that's something new, so make sure you're using diffusers from source. Probably that's why you're getting that error.

Thank you for your answer. What a odd thing is that I already installed the diffusers from source(version 0.28.0.dev0) but it still got error. However, I created a new environment and reinstalled diffusers from source it worked. Really appreciate it! Here is another question I want to ask, that is how to speed up inference? I trained dreambooth lora model with textual inversion, the model is saved as safetensors. It takes about 6.5s to generate one image on 3090. If I set num_images_per_prompt to 4, it takes more than 22s. Is there any way to optimize the speed of inference?

joey0922 avatar May 08 '24 10:05 joey0922

Yes, I want to do a guide about that but right now you can do what's in this discussion https://github.com/huggingface/diffusers/discussions/6609 but looking at your inference time, I think that the only one you're not doing is the torch.compile

So after that, I would suggest you look into this:

  • Use lighting or Hyper-SD.
  • align your steps
  • TGate with DeepCache (https://huggingface.co/docs/diffusers/main/en/optimization/tgate?pipelines=StableDiffusionXL+with+DeepCache)
  • Disable CFG after some steps (https://huggingface.co/docs/diffusers/main/en/using-diffusers/callback#dynamic-classifier-free-guidance)

If you use them right, I think you can lower your inference time a lot.

asomoza avatar May 08 '24 20:05 asomoza

Yes, I want to do a guide about that but right now you can do what's in this discussion #6609 but looking at your inference time, I think that the only one you're not doing is the torch.compile

So after that, I would suggest you look into this:

  • Use lighting or Hyper-SD.
  • align your steps
  • TGate with DeepCache (https://huggingface.co/docs/diffusers/main/en/optimization/tgate?pipelines=StableDiffusionXL+with+DeepCache)
  • Disable CFG after some steps (https://huggingface.co/docs/diffusers/main/en/using-diffusers/callback#dynamic-classifier-free-guidance)

If you use them right, I think you can lower your inference time a lot.

Thank you. I have tried some of them. I think if I want to use the first method you mentioned above, I have to use lightning or hyper-sd instead of sdxl base model to do fine tuning at first, right? Both align your steps and TGate with DeepCache make inference much faster, but meanwhile they really damage the quality of generated images. I have to figure out how to decrease the quality loss of generated images if I want to use them. Disbale CFG works well for me. I will keep working on this to make inference faster. Thanks again.

joey0922 avatar May 10 '24 02:05 joey0922