Fix textual inversion training with inpainting models
Fixes textual inversion training by adding a dummy mask when using an inpainting model
working good for me, no more need annoying switching between inpaint and standard models.
is there a reason this is not not using existing function for making masks
No good reason. I just couldn't find an existing method that could be reused easily. There is only StableDiffusionProcessing.txt2img_image_conditioning in processing.py. But that one can't be used without calling sample first. I didn't want to rewrite other parts of the code just to fix TI training for one specific use case.
Do you have any a specific solution in mind?
I just wanted to chime in to say it'd be sweet if TI worked with the inpainting model. It seems the conversation has stalled?
Looks like training the inpainting models still doesn't work, and the underlying code has changed enough that this pull request no longer works out of the box. Is there a better fix?
Looks like training the inpainting models still doesn't work, and the underlying code has changed enough that this pull request no longer works out of the box. Is there a better fix?
You can still use this version directly, without the newest features, if you just want to train. I'm not going to update the PR anymore, since @AUTOMATIC1111 doesn't seem to be interested in this feature, and I don't want to spend time on something that won't get merged anyway. I even had an implementation for hypernetworks and masked training ¯_(ツ)_/¯
Looks like training the inpainting models still doesn't work, and the underlying code has changed enough that this pull request no longer works out of the box. Is there a better fix?
You can still use this version directly, without the newest features, if you just want to train. I'm not going to update the PR anymore, since @AUTOMATIC1111 doesn't seem to be interested in this feature, and I don't want to spend time on something that won't get merged anyway. I even had an implementation for hypernetworks and masked training ¯_(ツ)_/¯
Does any other SD training tool have those features? Could maybe make one of the others do this.
I added training of inpainting models to StableTuner a few days ago, but that only supports finetuning/dreambooth, no TI or hypernetworks.
I don't get this solution. My original complaint still remains, you can't just copy code, if you need it, you can rework the code you need so it can be used from multiple places.
Inpainting model requires you to put a copy of original image into the conditioning. You just create conditioning once and use it always. So I don't think that can be right.
The dummy mask for the inpainting model is supposed to be made out of the image - you're sending text conditioning tensor, c there.
I don't get this solution. My original complaint still remains, you can't just copy code, if you need it, you can rework the code you need so it can be used from multiple places.
Then you should have told me that two months ago. I don't have a problem with rewriting code, IF you are ok with that. But usually it makes more sense to keep impact on other parts of the code to a minimum.
Inpainting model requires you to put a copy of original image into the conditioning. You just create conditioning once and use it always. So I don't think that can be right.
That is not correct. The conditioning image needs to be (original image) * (1-mask), converted to latent space. In this implementation I only train on unmasked images. Since embedings don't change the network, this will not cause the network to forget about the capabilities of the conditioning channels. I even tried the same thing for hypernetworks, which also worked. Training with an actual mask will lead to better results, but then you need a whole lot of other tools and functions to support a full training workflow. Just a few examples, not a full list:
- If you want to train on pre defined masks: Some tool to edit the masks, maybe even automated using clip segmentation
- If you just want to train on general images, maybe only a style: Some functionality to auto generate new sensible masks per epoch. This can only be done if the VAE stays in memory during training.
The dummy mask for the inpainting model is supposed to be made out of the image - you're sending text conditioning tensor, c there.
Again, this is also not correct. I used the same implementation that is also used elsewhere, where I take the batch size (x.shape[0]) and data type (x.dtype and x.device) from some other input to the u-net. I'm not "creating the conditioning image out of the text conditioning"
the message about using existing function is right there