[Feat]: Improve "rotation" augmentation
Describe your use-case.
Currently when using the "rotation" image augmentation, the corners of the resulting image are filled with black triangles. For inpainting model training this may not be a problem, but otherwise this feature tends to strongly bleed into the output. I've done some experimentation and have a few possible methods to reduce this issue, at least for rotations of relatively small angles. This would help a lot with increasing image variation for limited datasets. Below are some image examples and some standalone Python functions used to generate them.
What would you like to see as a solution?
Here is an example image and a few different methods of rotating it. Currently OT uses the first method (rotate without expanding), which keeps most of the content of the original image as well as the same size, but does create the black corners. Rotating with expanding keeps the entire original image but results in a much larger black area. Rotate and crop leaves no empty space, but cuts out much more of the original image. The latter two methods also require resizing the image in one direction or the other. These examples also are on a nearly square image, but images with tall/wide aspect ratios may need to be cropped differently to minimize the affected area. For the rest of the images I stuck with the first rotation method.
Here's a mask generated for the rotated image, with a few pixels of additional "padding" around the transition between image/background. This can probably be combined with any existing image masks automatically, and reduce the amount learned from these outside areas.
Some examples of using solid colors to fill the empty space. First picture uses the "average" color of the entire image as the fill. Second example uses the "dominant" color of the image, which works a bit better in examples like this where color variation is significant, or images with large mostly uniform backgrounds. Third picture takes the average color of the pixels along each edge and fills the corner with that color. This works well for edges where the color is mostly uniform (ex. if one edge is something like sky, water, or grass, or if the image has a solid-color border) but not as well for edges with big transitions. That could probably be automatically quantified by finding the variance along each edge, or doing the same "dominant" color calculation instead of the average, maybe applied on an edge-by-edge basis.
Here's some examples of the rotated image pasted onto itself. First one is simply the rotated image on top of the unrotated image. Most colors and textures line up okay, though the discontinuity is obvious in some spots. Second one uses the image rotated and pasted onto itself in small angle increments as the background, this doesn't have as much of an issue with discontinuities but the "twisting" creates some odd lines. Third image is pasted onto a rotated+cropped original image, this lines up better than the first example and straight lines are not bent at the transition, though some sections are still not clean. May be able to be somewhat improved by skewing/twisting the background image. The last image shows the third image with a Gaussian blur applied to the background section and the edge of the rotated image, though the same blur could be applied to any of these examples. Visually it helps reduce the obvious transition at the image edge, though I have no idea if it would benefit how an image model sees it. It may eventually teach the model that images have slightly blurry corners, but that's better than teaching it that images have large black triangles in the corners.
rotation_test.txt edit: updated
Here's the Python code I wrote for these rotations, mostly using PIL for image manipulation. Uses input image from OneTrainer/test_images and outputs to OneTrainer/test_images/output_img, but those paths can be changed. None of this is really ready to implement in mgds or OneTrainer, and I had some issues with converting images from "RGB" to "RGBA" so only png files work right now.
Have you considered alternatives? List them here.
There are a few other methods that could be used to fill the background that I didn't explore or couldn't figure out how to easily do in Python:
- Simply using a random color as a background fill may provide enough variation to avoid learning the corner feature
- Could also fill with some form of random noise
- Doing a "radial blur" on the edges of the image, extending the edge pixel colors straight out from the center
- Filling the edges with the original image mirrored along that axis, so that features at the edges are continuous (though may change direction)
- Some sort of "edge detection" to find lines that intersect with the edge of the original image, then continue them to the new edge and fill with nearest color
- Crop image based on preserving specific identified features (face, eyes, etc) or areas of "visual interest" (this is an example of that)
- Expanding the original image using seam carving (aka "content-aware rescale") before rotating
- This may also be useful for changing image aspect ratios into buckets with less distortion/cropping, and supports masks for preserving/removing specific objects
- It's possible it could cause odd pixel-level distortions that aren't easily visible but affect training negatively, but I haven't found any discussion about that Edit: some examples using seam carving
- Outpaint the outer areas of the image using the selected training model
- This would likely be the most computationally intensive method, and it would probably be more efficient to create one "expanded" image per training image and rotate+crop it for all later variations.
- Starting from one of the images pasted onto itself with some additional blur+noise could save steps and give more reliable results compared to outpainting from scratch
Any of these or the other methods may be able to be combined (ex. rotate 5 degrees with method A then 5 degrees with method B), or picked on a per-image basis based on image features, rotation angle, or just randomly.
Did some testing with a few of these methods to see directly how they would impact learning. I had a dataset of ~75 images and for each one created 4 copies with random rotations between +-10 degrees, using the "default" (black corners) method, the "dominant fill" method, and the "unrotated bg" with a Gaussian blur around the edges. The dataset also had masks around the character I generated previously so I copied/rotated those for each image as well, and tested the behavior with/without masked training enabled.
Insurance lizard guy, trained on BB95 (a 1.5-based model) and sampled on SwarmUI
You can see that it learned the character fairly well, and still has enough flexibility to give him different clothing and backgrounds than was present in the training data. Some hand weirdness but that's probably a limitation of 1.5. The effect of the rotation is also pretty obvious, the outputs trained on rotated variations do seem to have more tilt in the output. 10 degrees may have been a bit too much for training something like this - maybe it would be better if images rotated above a certain threshold were automatically tagged with something like "dutch angle, tilted, rotated left/right"?
The effect of the corners is also pretty clear. It doesn't always show up, but it's often enough to be a significant problem. The default behavior definitely results in black borders. The "dominant" fill method is slightly better but still creates big areas of flat colors. The "background" method didn't create any obvious problems in any of the tests I did, that seems to be the best overall. It might be creating a slight bias towards blurred backgrounds, but it doesn't seem to be overwhelming.
Masking the outside area of the image didn't seem to significantly help prevent the corner effects, at least not with the mask settings I used (0.2 weight, 0.1 unmasked probability, no normalization). More aggressive masking may help but could cause additional issues.
A few more comparisons of different methods and masks - "edge masks" are just a rectangular border around the image, then rotated to match the training image (see the second picture in the first post for an example). It should cover the corner regions plus around 5 pixels of overlap but not cover any of the background region. As before the LoRAs trained with masks weren't significantly better than the ones trained without. One of the "edge mask" images did show what looks like a ghost of the mask edge, not sure how that came up but may indicate that feathering the mask edges would be a bit beneficial. Otherwise the bg fill seems to be the best overall and probably the easiest.