diffusers icon indicating copy to clipboard operation
diffusers copied to clipboard

Add basic implementation of AuraFlowImg2ImgPipeline

Open AstraliteHeart opened this issue 9 months ago • 15 comments

What does this PR do?

Adds a very mechanical conversion of other img2img pipelines (mostly SD3/Flux) to support AuraFlow. Seems to require a bit more strength (0.75+) compared to SDXL (my only point of reference that I've used a lot for I2I) but works fine and does not complain about GGUF (still need to check compilation).

Fixes # (issue)

Before submitting

  • [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • [X] Did you read the contributor guideline?
  • [X] Did you read our philosophy doc (important for complex PRs)?
  • [ ] Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
  • [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
  • [X] Did you write any new necessary tests?

Who can review?

@cloneofsimo @sayakpaul @yiyixuxu @asomoza

AstraliteHeart avatar Apr 16 '25 11:04 AstraliteHeart

Thanks for yet another contribution! Could you post a snippet and some results?

sayakpaul avatar Apr 16 '25 12:04 sayakpaul

Apologies, had to clean up things and make the tests actually work.

The docstrings included in the CL should be a good snippet, i.e.

import torch
from diffusers import AuraFlowImg2ImgPipeline
import requests
from PIL import Image
from io import BytesIO

# download an initial image
url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
response = requests.get(url)
init_image = Image.open(BytesIO(response.content)).convert("RGB")
init_image = init_image.resize((768, 512))

pipe = AuraFlowImg2ImgPipeline.from_pretrained("fal/AuraFlow-v0.3", torch_dtype=torch.float16)
pipe = pipe.to("cuda")
prompt = "A fantasy landscape, trending on artstation"
image = pipe(prompt=prompt, image=init_image, strength=0.75, num_inference_steps=50).images[0]
image.save("aura_flow_img2img.png")

Unfortunately seems that my math may be wrong somehow?

With strength 0.75

image

with strength 0.85

image

with strength 0.95

image

but with 0.9

image

@DN6, any ideas?

AstraliteHeart avatar Apr 17 '25 22:04 AstraliteHeart

Unfortunately still seeing visual noise instead of the image at some values of strength so something in my math must be wrong.

AstraliteHeart avatar Apr 18 '25 10:04 AstraliteHeart

Do those values generally tend to be higher?

sayakpaul avatar Apr 18 '25 10:04 sayakpaul

image

AstraliteHeart avatar Apr 18 '25 12:04 AstraliteHeart

A bit hard to see sorry

sayakpaul avatar Apr 18 '25 12:04 sayakpaul

Weird, I can click on the image to get the full sized one (with an extra click) It's the 0.88 - 0.94 range that is visual noise, and I think images above 0.94 not even using initial image.

AstraliteHeart avatar Apr 18 '25 12:04 AstraliteHeart

I still have no idea what is going on but I think the code is correct yet AF may require some special work done. Here are my observations:

  1. When using certain input images, the VAE encoder, responsible for creating the initial latent representation (x0) of the input image, produces a latent distribution (latent_dist) where the standard deviation (std) component consistently collapses to effectively zero (e.g., std=0.0000, corresponding to a highly negative logvar).

  2. This variance collapse is observed even when ensuring the VAE was loaded and operated entirely in float32 precision. This confirms the issue is not merely an fp16 underflow problem during VAE computation but rather suggests the SDXL VAE predicts near-zero variance for this type of input (I was aware of different issues with SDXL VAE but not this specific behavior).

  3. Because the predicted std is zero, the subsequent step of sampling the initial latent variable x0 from this distribution (mean + std * noise) becomes deterministic, effectively yielding only the mean component (x0 = mean).

  4. To address the lack of variance in x0, I attempted an experiment in which the logvar output by the VAE was manually "clamped" to a minimum value (tested min_logvar = -10.0 resulting in std ≈ 0.0067, and min_logvar = -4.0 resulting in std ≈ 0.135) before sampling x0. This successfully introduced non-zero variance into the initial latent state. At this point my assumption was just - the issue is in VAE.

  5. Despite successfully injecting variance into x0 via clamping, the pipeline still produced noise/corrupted images when run at high strength values (e.g., strength=0.9).

  6. Crucially, the pipeline works reasonably well at lower strength values (e.g., strength=0.7), producing recognizable image outputs that incorporate the initial image structure.

The core issue no longer seems to be only the deterministic x0 caused by the initial VAE variance collapse (as fixing that didn't solve the high-strength problem). Instead, the failure at high strength (0.9) may stem from an instability in the denoising process itself when initiated from the very high noise levels corresponding to these high strengths. The process is stable when starting from the lower noise levels associated with moderate strength (0.7).

AstraliteHeart avatar Apr 19 '25 06:04 AstraliteHeart

Thanks for the detailed analysis. Do these vary from AuraFlow and AuraFlow0.3?

sayakpaul avatar Apr 19 '25 12:04 sayakpaul

Yes, I am seeing same behavior in 0.2.

I focused too much on VAE in the last comment - I don't think it's the root cause (after all SDXL works just fine) and perhaps the real issue is some kind numerical instability we are facing?

I've included 3 videos generated taking a snapshot of the model state each 5 frames - before, at the moment of the issue and after. At least looking at them I can't notice anything weird that can explain the problem

https://github.com/user-attachments/assets/471f3572-5552-4f4e-8632-f459675ce44d

https://github.com/user-attachments/assets/e4acd117-7b89-44e8-819f-e1370026e13a

https://github.com/user-attachments/assets/2ce0fe29-6c47-445d-9baa-dd5c9bbc22d3

I've also attempted to affect the problematic strength range by changing number of steps or guidance scale but it had no effect. Interestingly enabling use_karras_sigmas=True on the scheduler seems to "fix" the issue as I no longer can hit the noise output but I still experience an very sharp change from "low strength i2i" to "just t2i" at around 0.98 strength. Feels like I am missing something super obvious here.

AstraliteHeart avatar Apr 19 '25 20:04 AstraliteHeart

Thank you @bghira, your code makes much more sense! Things are definitely looking better now and lower strengths produce reasonably results, but the noise issue remains and after running a few experiments I continue to be confused. The banding pattern is interesting, perhaps wrong dimensions somewhere?

image

AstraliteHeart avatar May 02 '25 21:05 AstraliteHeart

can you try with square samples?

bghira avatar May 02 '25 21:05 bghira

another thing i've discovered lately but wasn't sure whether i should share it or not, as, withholding it means my own platform's image results are better :sweat_smile: but if we decouple the strength from the number of steps / starting step for denoising, we can modulate the img2img by the sigmas and get a more smooth trajectory on the denoising updates.

bghira avatar May 02 '25 21:05 bghira

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar May 27 '25 15:05 github-actions[bot]

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.