stable-diffusion-webui Fix textual inversion training with inpainting models

Fixes textual inversion training by adding a dummy mask when using an inpainting model

Oct 23 '22 12:10 Nerogar

working good for me, no more need annoying switching between inpaint and standard models.

Nov 01 '22 12:11 fm3at

is there a reason this is not not using existing function for making masks

Nov 01 '22 15:11 AUTOMATIC1111

No good reason. I just couldn't find an existing method that could be reused easily. There is only StableDiffusionProcessing.txt2img_image_conditioning in processing.py. But that one can't be used without calling sample first. I didn't want to rewrite other parts of the code just to fix TI training for one specific use case.

Do you have any a specific solution in mind?

Nov 01 '22 19:11 Nerogar

I just wanted to chime in to say it'd be sweet if TI worked with the inpainting model. It seems the conversation has stalled?

Nov 20 '22 14:11 Dekker3D

Looks like training the inpainting models still doesn't work, and the underlying code has changed enough that this pull request no longer works out of the box. Is there a better fix?

Jan 02 '23 03:01 willhsmit

Looks like training the inpainting models still doesn't work, and the underlying code has changed enough that this pull request no longer works out of the box. Is there a better fix?

You can still use this version directly, without the newest features, if you just want to train. I'm not going to update the PR anymore, since @AUTOMATIC1111 doesn't seem to be interested in this feature, and I don't want to spend time on something that won't get merged anyway. I even had an implementation for hypernetworks and masked training ¯_(ツ)_/¯

Jan 02 '23 10:01 Nerogar

Looks like training the inpainting models still doesn't work, and the underlying code has changed enough that this pull request no longer works out of the box. Is there a better fix?

You can still use this version directly, without the newest features, if you just want to train. I'm not going to update the PR anymore, since @AUTOMATIC1111 doesn't seem to be interested in this feature, and I don't want to spend time on something that won't get merged anyway. I even had an implementation for hypernetworks and masked training ¯_(ツ)_/¯

Does any other SD training tool have those features? Could maybe make one of the others do this.

Jan 02 '23 10:01 Dekker3D

I added training of inpainting models to StableTuner a few days ago, but that only supports finetuning/dreambooth, no TI or hypernetworks.

Jan 02 '23 10:01 Nerogar

I don't get this solution. My original complaint still remains, you can't just copy code, if you need it, you can rework the code you need so it can be used from multiple places.

Inpainting model requires you to put a copy of original image into the conditioning. You just create conditioning once and use it always. So I don't think that can be right.

The dummy mask for the inpainting model is supposed to be made out of the image - you're sending text conditioning tensor, c there.

Jan 04 '23 11:01 AUTOMATIC1111

I don't get this solution. My original complaint still remains, you can't just copy code, if you need it, you can rework the code you need so it can be used from multiple places.

Then you should have told me that two months ago. I don't have a problem with rewriting code, IF you are ok with that. But usually it makes more sense to keep impact on other parts of the code to a minimum.

Inpainting model requires you to put a copy of original image into the conditioning. You just create conditioning once and use it always. So I don't think that can be right.

That is not correct. The conditioning image needs to be (original image) * (1-mask), converted to latent space. In this implementation I only train on unmasked images. Since embedings don't change the network, this will not cause the network to forget about the capabilities of the conditioning channels. I even tried the same thing for hypernetworks, which also worked. Training with an actual mask will lead to better results, but then you need a whole lot of other tools and functions to support a full training workflow. Just a few examples, not a full list:

If you want to train on pre defined masks: Some tool to edit the masks, maybe even automated using clip segmentation
If you just want to train on general images, maybe only a style: Some functionality to auto generate new sensible masks per epoch. This can only be done if the VAE stays in memory during training.

The dummy mask for the inpainting model is supposed to be made out of the image - you're sending text conditioning tensor, c there.

Again, this is also not correct. I used the same implementation that is also used elsewhere, where I take the batch size (x.shape[0]) and data type (x.dtype and x.device) from some other input to the u-net. I'm not "creating the conditioning image out of the text conditioning"

Jan 04 '23 13:01 Nerogar

the message about using existing function is right there

Jan 04 '23 14:01 AUTOMATIC1111