stable-diffusion.cpp icon indicating copy to clipboard operation
stable-diffusion.cpp copied to clipboard

feat: add easycache support

Open rmatif opened this issue 2 months ago • 9 comments

This PR adds support for Easycache, a variant of TeaCache that achieves significant speedup

Currently tested only with CUDA and on Flux/Qwen

Command usage:

--easycache threshold,start_percent,end_percent

Examples:

Without Easycache --easycache 0.2,0.15,0.95 Speedup
output2 output3 x1.85
noeasycache easycache x1.85

rmatif avatar Nov 04 '25 23:11 rmatif

[!NOTE] Outdated

Ran it with a SPARK (preview) Chroma finetune

thresh img real speedup
0 ec_chroma spark_noec baseline
0.025 ec_chroma spark_0 025 1.12x
0.1 ec_chroma spark_0 1 1.31x
0.2 ec_chroma spark_0 2 1.4x

I noticed the estimated speedup is off. It is off exactly by 2x, so I guess cfg is not handled properly yet.

eg. 40steps with cfg are actually 80 steps

40/(40-8) -> 1.25 (estimated 1.25)
80/(80-8) -> 1.11 (measured 1.12)

I included 0.025, because that is what the original used for wan.

Anyway Good stuff, I take the 11% speedup.

$ result/bin/sd --diffusion-model models/SPARK.Chroma_preview-q5_k.gguf --t5xxl models/flux-extra/t5xxl_fp16.safetensors -t 8 --vae models/flux-extra/ae-f16.gguf --sampling-method dpm++2m --scheduler simple --steps 40 --cfg-scale 3.8 -n "low quality, ugly, unfinished, out of focus, deformed, disfigure, blurry, smudged, restricted palette, flat colors, noisy, artifacts, fake, generated, overblown, over exposed" -p "Photograph of a lovely cat. Green rolling hills in the background." --clip-on-cpu --offload-to-cpu -v -W 768 -H 1024 --diffusion-fa --chroma-disable-dit-mask --easycache 0.025,0.15,0.95

Green-Sky avatar Nov 06 '25 10:11 Green-Sky

Ran it with a SPARK (preview) Chroma finetune

thresh img real speedup 0 ec_chroma spark_noec baseline 0.025 ec_chroma spark_0 025 1.12x 0.1 ec_chroma spark_0 1 1.31x 0.2 ec_chroma spark_0 2 1.4x I noticed the estimated speedup is off. It is off exactly by 2x, so I guess cfg is not handled properly yet.

eg. 40steps with cfg are actually 80 steps

40/(40-8) -> 1.25 (estimated 1.25)
80/(80-8) -> 1.11 (measured 1.12)

I included 0.025, because that is what the original used for wan.

Anyway Good stuff, I take the 11% speedup.

$ result/bin/sd --diffusion-model models/SPARK.Chroma_preview-q5_k.gguf --t5xxl models/flux-extra/t5xxl_fp16.safetensors -t 8 --vae models/flux-extra/ae-f16.gguf --sampling-method dpm++2m --scheduler simple --steps 40 --cfg-scale 3.8 -n "low quality, ugly, unfinished, out of focus, deformed, disfigure, blurry, smudged, restricted palette, flat colors, noisy, artifacts, fake, generated, overblown, over exposed" -p "Photograph of a lovely cat. Green rolling hills in the background." --clip-on-cpu --offload-to-cpu -v -W 768 -H 1024 --diffusion-fa --chroma-disable-dit-mask --easycache 0.025,0.15,0.95

Is this additional noise in the image expected? I haven’t noticed it in other Tea Cache tests or similar methods.

JohnLoveJoy avatar Nov 06 '25 14:11 JohnLoveJoy

Is this additional noise in the image expected? I haven’t noticed it in other Tea Cache tests or similar methods.

I speculate that this happens with some models (see second pic OP) when too much is skipped at the end of the sampling phase. You can lower the second value from 0.95 to 0.80 or lower. This will result in less skips late, but also less skips overall, so it is not conclusive(!).

I observed similar behavoir with beta/smoothstep schedulers, which reduce the noise in those models. The scheduler spends more time in early and late timesteps of the sampling (s-cuve or gain function behavoir).

edit: Ofc this pr can be broken too. Also other factors is the quant used, which seems to exacerbate the noise.

edit2: Here is what smoothstep(almost beta) with 20 steps looks like: output

Green-Sky avatar Nov 06 '25 14:11 Green-Sky

I think it would be simpler if --easycache threshold,start_percent,end_percent is split into 3 or 4 commands. You generally just want to enable the safe default OR only modify threshold. Or maybe make star and end optional? Not sure.

Green-Sky avatar Nov 07 '25 17:11 Green-Sky

I think it would be simpler if --easycache threshold,start_percent,end_percent is split into 3 or 4 commands. You generally just want to enable the safe default OR only modify threshold. Or maybe make star and end optional? Not sure.

Thanks for the suggestion! I think we can just make everything optional, but the quality seems to vary depending on the model/quant type, so this feature is definitely for "advanced" users I'd say. I’ll come back to this PR this weekend

rmatif avatar Nov 07 '25 20:11 rmatif

It seems the results for the WAN video model are not very good. I'm not sure if something is wrong somewhere.

Without easycache

[DEBUG] stable-diffusion.cpp:3856 - sample 104x60x9
[DEBUG] ggml_extend.hpp:1656 - Wan2.1-T2V-1.3B compute buffer size: 2575.50 MB(VRAM)
  |==================================================| 20/20 - 2.22s/it
[INFO ] stable-diffusion.cpp:3884 - sampling completed, taking 44.32s

https://github.com/user-attachments/assets/9c7ad71b-7264-4765-8a7f-5f17814b33c5

With --easycache 0.2,0.15,0.95

[DEBUG] stable-diffusion.cpp:3856 - sample 104x60x9
[INFO ] stable-diffusion.cpp:1778 - EasyCache enabled - threshold: 0.200, start_percent: 0.15, end_percent: 0.95
[DEBUG] ggml_extend.hpp:1656 - Wan2.1-T2V-1.3B compute buffer size: 2575.50 MB(VRAM)
  |==================================================| 20/20 - 1.64s/it
[INFO ] stable-diffusion.cpp:2065 - EasyCache skipped 10/20 steps (2.00x estimated speedup)
[INFO ] stable-diffusion.cpp:3884 - sampling completed, taking 32.90s

https://github.com/user-attachments/assets/30c32867-04ab-4320-9f3e-5cbd21f59ec8

With --easycache 0.05,0.15,0.95

[DEBUG] stable-diffusion.cpp:3856 - sample 104x60x9
[INFO ] stable-diffusion.cpp:1778 - EasyCache enabled - threshold: 0.050, start_percent: 0.15, end_percent: 0.95
[DEBUG] ggml_extend.hpp:1656 - Wan2.1-T2V-1.3B compute buffer size: 2575.50 MB(VRAM)
  |==================================================| 20/20 - 1.82s/it
[INFO ] stable-diffusion.cpp:2065 - EasyCache skipped 7/20 steps (1.54x estimated speedup)
[INFO ] stable-diffusion.cpp:3884 - sampling completed, taking 36.40s

https://github.com/user-attachments/assets/b4a75cc7-7680-48d7-b9cf-946e91f2bbd8

leejet avatar Nov 16 '25 09:11 leejet

The values 0.025,0.15,0.95 worked pretty ok with wan2.2 5B. Those are also the defaults in the original implementation. I will update an rerun that later.

edit: see https://github.com/leejet/stable-diffusion.cpp/issues/943

Green-Sky avatar Nov 16 '25 10:11 Green-Sky

reworked the sampling loop so EasyCache participates in every diffusion-model call instead of just the first one before, CFG pairs should now stay in sync

@Green-Sky

results for 0.2,0.15,0.95 now

output

@leejet

Results on wan2.1 1.3B

Without easycache

https://github.com/user-attachments/assets/06c3188e-b719-42f3-9dc7-1dc44e9f816b

With easycache 0.2,0.15,0.95

https://github.com/user-attachments/assets/ccd387fd-69ca-4acb-9881-80c256d4ef2d

I think this is ready to be merged

rmatif avatar Nov 17 '25 21:11 rmatif

reworked the sampling loop so EasyCache participates in every diffusion-model call instead of just the first one before, CFG pairs should now stay in sync

Great! It now works and looks good in most cases for 0.2 AND the estimates do now match. :rocket:

thresh img real speedup
0 output 1.00x (355.82s)
0.025 output 1.28x (277.47s)
0.1 output 2.0x (176.18s)
0.2 output 2.37x (150.00s)
0.3 output 2.50 (142.11s)

I think this is ready to be merged

Yes, I think this is ready for review.

Did someone test heun yet? Not that you would combine heun and easycache, but still.

Green-Sky avatar Nov 18 '25 12:11 Green-Sky

LGTM. Thank you for your contribution!

Without easycache

https://github.com/user-attachments/assets/a8fd06c0-97bc-4dd3-b8ba-6d0c5f49a951

With easycache

https://github.com/user-attachments/assets/cd8b2ddd-2117-4567-9bf1-5fda24ea462c

leejet avatar Nov 19 '25 15:11 leejet