Add basic implementation of AuraFlowImg2ImgPipeline
What does this PR do?
Adds a very mechanical conversion of other img2img pipelines (mostly SD3/Flux) to support AuraFlow. Seems to require a bit more strength (0.75+) compared to SDXL (my only point of reference that I've used a lot for I2I) but works fine and does not complain about GGUF (still need to check compilation).
Fixes # (issue)
Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [X] Did you read the contributor guideline?
- [X] Did you read our philosophy doc (important for complex PRs)?
- [ ] Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
- [X] Did you write any new necessary tests?
Who can review?
@cloneofsimo @sayakpaul @yiyixuxu @asomoza
Thanks for yet another contribution! Could you post a snippet and some results?
Apologies, had to clean up things and make the tests actually work.
The docstrings included in the CL should be a good snippet, i.e.
import torch
from diffusers import AuraFlowImg2ImgPipeline
import requests
from PIL import Image
from io import BytesIO
# download an initial image
url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
response = requests.get(url)
init_image = Image.open(BytesIO(response.content)).convert("RGB")
init_image = init_image.resize((768, 512))
pipe = AuraFlowImg2ImgPipeline.from_pretrained("fal/AuraFlow-v0.3", torch_dtype=torch.float16)
pipe = pipe.to("cuda")
prompt = "A fantasy landscape, trending on artstation"
image = pipe(prompt=prompt, image=init_image, strength=0.75, num_inference_steps=50).images[0]
image.save("aura_flow_img2img.png")
Unfortunately seems that my math may be wrong somehow?
With strength 0.75
with strength 0.85
with strength 0.95
but with 0.9
@DN6, any ideas?
Unfortunately still seeing visual noise instead of the image at some values of strength so something in my math must be wrong.
Do those values generally tend to be higher?
A bit hard to see sorry
Weird, I can click on the image to get the full sized one (with an extra click) It's the 0.88 - 0.94 range that is visual noise, and I think images above 0.94 not even using initial image.
I still have no idea what is going on but I think the code is correct yet AF may require some special work done. Here are my observations:
-
When using certain input images, the VAE encoder, responsible for creating the initial latent representation (
x0) of the input image, produces a latent distribution (latent_dist) where the standard deviation (std) component consistently collapses to effectively zero (e.g.,std=0.0000, corresponding to a highly negativelogvar). -
This variance collapse is observed even when ensuring the VAE was loaded and operated entirely in
float32precision. This confirms the issue is not merely anfp16underflow problem during VAE computation but rather suggests the SDXL VAE predicts near-zero variance for this type of input (I was aware of different issues with SDXL VAE but not this specific behavior). -
Because the predicted
stdis zero, the subsequent step of sampling the initial latent variablex0from this distribution (mean + std * noise) becomes deterministic, effectively yielding only themeancomponent (x0 = mean). -
To address the lack of variance in
x0, I attempted an experiment in which thelogvaroutput by the VAE was manually "clamped" to a minimum value (testedmin_logvar = -10.0resulting instd ≈ 0.0067, andmin_logvar = -4.0resulting instd ≈ 0.135) before samplingx0. This successfully introduced non-zero variance into the initial latent state. At this point my assumption was just - the issue is in VAE. -
Despite successfully injecting variance into
x0via clamping, the pipeline still produced noise/corrupted images when run at highstrengthvalues (e.g.,strength=0.9). -
Crucially, the pipeline works reasonably well at lower
strengthvalues (e.g.,strength=0.7), producing recognizable image outputs that incorporate the initial image structure.
The core issue no longer seems to be only the deterministic x0 caused by the initial VAE variance collapse (as fixing that didn't solve the high-strength problem). Instead, the failure at high strength (0.9) may stem from an instability in the denoising process itself when initiated from the very high noise levels corresponding to these high strengths. The process is stable when starting from the lower noise levels associated with moderate strength (0.7).
Thanks for the detailed analysis. Do these vary from AuraFlow and AuraFlow0.3?
Yes, I am seeing same behavior in 0.2.
I focused too much on VAE in the last comment - I don't think it's the root cause (after all SDXL works just fine) and perhaps the real issue is some kind numerical instability we are facing?
I've included 3 videos generated taking a snapshot of the model state each 5 frames - before, at the moment of the issue and after. At least looking at them I can't notice anything weird that can explain the problem
https://github.com/user-attachments/assets/471f3572-5552-4f4e-8632-f459675ce44d
https://github.com/user-attachments/assets/e4acd117-7b89-44e8-819f-e1370026e13a
https://github.com/user-attachments/assets/2ce0fe29-6c47-445d-9baa-dd5c9bbc22d3
I've also attempted to affect the problematic strength range by changing number of steps or guidance scale but it had no effect. Interestingly enabling use_karras_sigmas=True on the scheduler seems to "fix" the issue as I no longer can hit the noise output but I still experience an very sharp change from "low strength i2i" to "just t2i" at around 0.98 strength. Feels like I am missing something super obvious here.
Thank you @bghira, your code makes much more sense! Things are definitely looking better now and lower strengths produce reasonably results, but the noise issue remains and after running a few experiments I continue to be confused. The banding pattern is interesting, perhaps wrong dimensions somewhere?
can you try with square samples?
another thing i've discovered lately but wasn't sure whether i should share it or not, as, withholding it means my own platform's image results are better :sweat_smile: but if we decouple the strength from the number of steps / starting step for denoising, we can modulate the img2img by the sigmas and get a more smooth trajectory on the denoising updates.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.