diffusers confusion when train dreambooth via script train_dreambooth_lora_sdxl

I used the train_dreambooth_lora_sdxl_advanced script downloaded from diffusers' official exmaples to train my own images, however, both the validation images generated and the images generated by trained model are of bad quality. They are always very indistinct. I used the script offered by diffsuers examples and didn't change any parameters. Why did this happen? How can I fix it?

Apr 25 '24 11:04 joey0922

Have you read this blog post: LoRA training scripts of the world, unite! It might be helpful.

Apr 25 '24 11:04 tolgacangoz

Have you read this blog post: LoRA training scripts of the world, unite! It might be helpful.

Yeah, I have read and try it, but the generated images are still blurry, even broken. They seem not finished generating. Maybe it is because of my vae? I have no idea how to fix it.

Apr 26 '24 02:04 joey0922

the first thing that comes to my mind it's the dataset, are you using your own dataset?

Apr 26 '24 05:04 asomoza

the first thing that comes to my mind it's the dataset, are you using your own dataset?

yeah, I used my own dataset. I think you are right. I found the quality of generated images are better when I trained model with more images. May I have your contact pls? So I can ask you for advice when I have problems. Thanks!

Apr 26 '24 07:04 joey0922

the first thing that comes to my mind it's the dataset, are you using your own dataset?

yeah, I used my own dataset. I think you are right. I found the quality of generated images are better when I trained model with more images. May I have your contact pls? So I can ask you for advice when I have problems. Thanks!

@asomoza Are there any requiements for training data in terms of volume and quality？I used about 30 images to train the model, and it turned out better but still blurry. It is because my dataset is not good enough? The images I used are downloaded from internet. Though they have different perspectives and various backgrounds.

I found a parameter named cross_attention_kwargs in inference, when a set "scale" key of it lower the generated image will be of higher quality, meanwhile the subject is less similar. Do you have any idea? Increasing the volume of train data still can't fix the problem completely.

Apr 26 '24 11:04 joey0922

May I have your contact pls? So I can ask you for advice when I have problems. Thanks!

I'll rather answer here so other people can also learn and as a reminder to myself too.

Are there any requiements for training data in terms of volume and quality？

For dreambooth there isn't a requirement of number of images, it can be as little as 10 or less, it all depends on what you want to train and the results you want.

For example to train a style you'll need a lot more images so the model learn it and doesn't associate anything from the images as the style. If you just want it to learn a closeup of a face, it can be even one image but if you want it to learn a person in various poses and distances you'll need at least a couple of images of each one.

About the quality, since you're training for SDXL, you'll need the images to be of 1024x1024 pixels or higher, until you learn to train and have a couple of good trainings don't go lower. Same for aspect ratio and data augmentation, until you learn how to train, just use square images and crop them yourself, make sure that each image has what you want and what's in the caption you're providing.

You can use blurry or lower quality images if you caption them as so, but in my experience it's better to just remove them from the dataset.

I found a parameter named cross_attention_kwargs in inference, when a set "scale" key of it lower the generated image will be of higher quality, meanwhile the subject is less similar.

You can learn about this in the docs: https://huggingface.co/docs/diffusers/tutorials/using_peft_for_inference#merge-adapters.

But essentially what you're doing is lowering the effect of the LoRA and letting the base model affect more of the image.

Is not a requirement but I like the LoRAs to be good at 1.0 scale, that means that your LoRA should work as expected without the need of making the scale higher or lower than 1.0

In my opinion, if you have to lower the scale of the LoRA it's over trained, but there's a lot of people that just lower the scale instead of retraining them.

Apr 26 '24 13:04 asomoza

May I have your contact pls? So I can ask you for advice when I have problems. Thanks!

I'll rather answer here so other people can also learn and as a reminder to myself too.

Are there any requiements for training data in terms of volume and quality？

For dreambooth there isn't a requirement of number of images, it can be as little as 10 or less, it all depends on what you want to train and the results you want.

For example to train a style you'll need a lot more images so the model learn it and doesn't associate anything from the images as the style. If you just want it to learn a closeup of a face, it can be even one image but if you want it to learn a person in various poses and distances you'll need at least a couple of images of each one.

About the quality, since you're training for SDXL, you'll need the images to be of 1024x1024 pixels or higher, until you learn to train and have a couple of good trainings don't go lower. Same for aspect ratio and data augmentation, until you learn how to train, just use square images and crop them yourself, make sure that each image has what you want and what's in the caption you're providing.

You can use blurry or lower quality images if you caption them as so, but in my experience it's better to just remove them from the dataset.

I found a parameter named cross_attention_kwargs in inference, when a set "scale" key of it lower the generated image will be of higher quality, meanwhile the subject is less similar.

You can learn about this in the docs: https://huggingface.co/docs/diffusers/tutorials/using_peft_for_inference#merge-adapters.

But essentially what you're doing is lowering the effect of the LoRA and letting the base model affect more of the image.

Is not a requirement but I like the LoRAs to be good at 1.0 scale, that means that your LoRA should work as expected without the need of making the scale higher or lower than 1.0

In my opinion, if you have to lower the scale of the LoRA it's over trained, but there's a lot of people that just lower the scale instead of retraining them.

Thanks for your answer. I will give it a try. I intend to generate a specified car that base model didn’t see before with background. I have consult with you and followed your outpainting guide to do it. But I found the car can’t blend in the background naturally by in painting. I have tried to train Dreambooth Lora SDXL version, the quality of image was good but the generated car was always not so similar to the original car. I have no idea if it was because I refine the generated image to improve the image quality. So I want to try the script in advanced diffusion training. It is also able to train a multiple concept Lora model. Besides the dataset, what kinds of things I may be able to fix this? How can I adjust the hyper parameters to find the sweet spot?

Apr 26 '24 16:04 joey0922

For the best common practices you can refer to the documentation and the blog post that @standardAI linked before.

Sadly there's not an easy and surefire way to find the sweet spot, to many variables to take into account for the training and the generation.

All the good LoRA trainers I know, have trained hundreds of LoRAs before getting as good as they are now, it's just something that you learn by doing it a lot of times.

Having said that, if you're doing this for a professional kind of work, I don't think we're yet in a position to delegate everything to the AI , you'll need a combination of AI and manual work (a lot) to get a good results if you need the original car to blend perfectly in the background without distorting it.

Personally I haven't seen someone working with real cars, I think they're part of the things AI is bad at like hands and logos because they have so many little details that can go bad, you should be okay with simple generic cars though.

When I have the time I'll test a LoRA with car images (with creative common license) but it's not going to be soon.

As a last resort maybe you can write to some of the https://civitai.com/tag/car creators if you find a good LoRA that works the way you need it.

Apr 26 '24 18:04 asomoza

For the best common practices you can refer to the documentation and the blog post that @standardAI linked before.

Sadly there's not an easy and surefire way to find the sweet spot, to many variables to take into account for the training and the generation.

All the good LoRA trainers I know, have trained hundreds of LoRAs before getting as good as they are now, it's just something that you learn by doing it a lot of times.

Having said that, if you're doing this for a professional kind of work, I don't think we're yet in a position to delegate everything to the AI , you'll need a combination of AI and manual work (a lot) to get a good results if you need the original car to blend perfectly in the background without distorting it.

Personally I haven't seen someone working with real cars, I think they're part of the things AI is bad at like hands and logos because they have so many little details that can go bad, you should be okay with simple generic cars though.

When I have the time I'll test a LoRA with car images (with creative common license) but it's not going to be soon.

As a last resort maybe you can write to some of the https://civitai.com/tag/car creators if you find a good LoRA that works the way you need it.

Got it, thanks for your help. I will keep trying.

Apr 26 '24 23:04 joey0922

@asomoza Hi, I used high quality images to train the model and it turned out much better. Though some tiny details are not perfect but it is acceptable. I have another question here, I want to use this script to train multiple concepts(or subjects) into one model, in my case, the concepts(or subjects) are different types of cars, but I didn't find any instruction. Do you know how to do this? Many thanks!

Apr 29 '24 08:04 joey0922

@asomoza Hi, I used high quality images to train the model and it turned out much better. Though some tiny details are not perfect but it is acceptable. I have another question here, I want to use this script to train multiple concepts(or subjects) into one model, in my case, the concepts(or subjects) are different types of cars, but I didn't find any instruction. Do you know how to do this? Many thanks!

@asomoza sorry to bothering. I figured it out myself. But I found when a load lora model with base model, it seldomly can change the style of generated image, like painting, cartoon style and so on. It keeps generate a photo nearly 100%. Why did this happen? What should I do to keep the ability of base model's style changing? Thanks.

Apr 29 '24 09:04 joey0922

Hi, I'm glad you're progressing towards your goal.

For what you want, you have multiple options, one is to just lower the scale of the lora until you see the effect of the base model, also it helps if you suggest it in the prompt, but you can lose some of the original detail of the lora.

Also, when I train LoRAs I save multiple epochs and test them with styles in the prompt and different models, usually I choose one where I can use it 1.0, can change styles and gives the results I want, but that's the ideal LoRA training and can be hard to achieve at first when you're learning.

Another method is to use IP Adapters with InstantStyle to force the model to apply the style, you can read about how to do it here: https://huggingface.co/docs/diffusers/main/en/using-diffusers/ip_adapter#style--layout-control

Finally you can train your LoRA with that, in the captions or tags, add the media format like photo, cartoon, anime, drawing for it, but you'll need to provide also images with this styles for the model to learn. If you're using common styles, you'll just need a couple of them so the model just learns that it can change it.

Apr 29 '24 17:04 asomoza

Hi, I'm glad you're progressing towards your goal.

For what you want, you have multiple options, one is to just lower the scale of the lora until you see the effect of the base model, also it helps if you suggest it in the prompt, but you can lose some of the original detail of the lora.

Also, when I train LoRAs I save multiple epochs and test them with styles in the prompt and different models, usually I choose one where I can use it 1.0, can change styles and gives the results I want, but that's the ideal LoRA training and can be hard to achieve at first when you're learning.

Another method is to use IP Adapters with InstantStyle to force the model to apply the style, you can read about how to do it here: https://huggingface.co/docs/diffusers/main/en/using-diffusers/ip_adapter#style--layout-control

Finally you can train your LoRA with that, in the captions or tags, add the media format like photo, cartoon, anime, drawing for it, but you'll need to provide also images with this styles for the model to learn. If you're using common styles, you'll just need a couple of them so the model just learns that it can change it.

Thanks for you reply. I tried the ip adapter method and found a typeerror, it said unsupported operand type(s) for *: 'dict' and 'Tensor'. How to fix it? I know it is because of the scale type, but I have no idea how to fix it.

May 06 '24 04:05 joey0922

I'll need a reproducible code snippet and the error log to see where the error is happening to be able to help you with that.

May 06 '24 15:05 asomoza

I'll need a reproducible code snippet and the error log to see where the error is happening to be able to help you with that.

of course, my code is blow: `import torch from diffusers.utils import load_image from diffusers import DiffusionPipeline

model_name_or_path = "stable-diffusion-xl-base-1.0" pipe = DiffusionPipeline.from_pretrained( model_name_or_path, torch_dtype=torch.float16, variant="fp16" ).to("cuda") ip_adapter = "ip-adapter" pipe.load_ip_adapter(ip_adapter, subfolder="sdxl_models", weight_name="ip-adapter_sdxl.bin")

scale = { "up": {"block_0": [0.0, 1.0, 0.0]}, } pipe.set_ip_adapter_scale(scale)

style_image = load_image("style.png")

prompt = "a car in street of a modern city at night along with towers and buildings" negative_prompt = "" image = pipe(prompt=prompt, negative_prompt=negative_prompt, num_inference_steps=25, num_images_per_prompt=1, ip_adapter_image=style_image, guidance_scale=5, width=1152, height=720).images[0]`

and the error message is as following: TypeError Traceback (most recent call last) /home/joey/workspace/AIGC/finetuning/inference.ipynb 单元格 31 line 3 1 prompt = "a car in street of a modern city at night along with towers and buildings" 2 negative_prompt = "" ----> 3 image = pipe(prompt=prompt, negative_prompt=negative_prompt, num_inference_steps=25, num_images_per_prompt=1, 4 ip_adapter_image=style_image, guidance_scale=5, width=1152, height=720).images[0]

File ~/anaconda3/envs/aigc/lib/python3.10/site-packages/torch/utils/_contextlib.py:115, in context_decorator..decorate_context(*args, **kwargs) 112 @functools.wraps(func) 113 def decorate_context(*args, **kwargs): 114 with ctx_factory(): --> 115 return func(*args, **kwargs)

File ~/anaconda3/envs/aigc/lib/python3.10/site-packages/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py:1176, in StableDiffusionXLPipeline.call(self, prompt, prompt_2, height, width, num_inference_steps, timesteps, denoising_end, guidance_scale, negative_prompt, negative_prompt_2, num_images_per_prompt, eta, generator, latents, prompt_embeds, negative_prompt_embeds, pooled_prompt_embeds, negative_pooled_prompt_embeds, ip_adapter_image, ip_adapter_image_embeds, output_type, return_dict, cross_attention_kwargs, guidance_rescale, original_size, crops_coords_top_left, target_size, negative_original_size, negative_crops_coords_top_left, negative_target_size, clip_skip, callback_on_step_end, callback_on_step_end_tensor_inputs, **kwargs) 1174 if ip_adapter_image is not None or ip_adapter_image_embeds is not None: 1175 added_cond_kwargs["image_embeds"] = image_embeds -> 1176 noise_pred = self.unet( 1177 latent_model_input, 1178 t, 1179 encoder_hidden_states=prompt_embeds, 1180 timestep_cond=timestep_cond, 1181 cross_attention_kwargs=self.cross_attention_kwargs, 1182 added_cond_kwargs=added_cond_kwargs, 1183 return_dict=False, 1184 )[0] 1186 # perform guidance 1187 if self.do_classifier_free_guidance: