stable-diffusion.cpp feat: add easycache support

This PR adds support for Easycache, a variant of TeaCache that achieves significant speedup

Currently tested only with CUDA and on Flux/Qwen

Command usage:

--easycache threshold,start_percent,end_percent

Examples:

Without Easycache	`--easycache 0.2,0.15,0.95`	Speedup
		x1.85
		x1.85

Nov 04 '25 23:11 rmatif

[!NOTE] Outdated

Ran it with a SPARK (preview) Chroma finetune

thresh	img	real speedup
0		baseline
0.025		1.12x
0.1		1.31x
0.2		1.4x

I noticed the estimated speedup is off. It is off exactly by 2x, so I guess cfg is not handled properly yet.

eg. 40steps with cfg are actually 80 steps

40/(40-8) -> 1.25 (estimated 1.25)
80/(80-8) -> 1.11 (measured 1.12)

I included 0.025, because that is what the original used for wan.

Anyway Good stuff, I take the 11% speedup.

$ result/bin/sd --diffusion-model models/SPARK.Chroma_preview-q5_k.gguf --t5xxl models/flux-extra/t5xxl_fp16.safetensors -t 8 --vae models/flux-extra/ae-f16.gguf --sampling-method dpm++2m --scheduler simple --steps 40 --cfg-scale 3.8 -n "low quality, ugly, unfinished, out of focus, deformed, disfigure, blurry, smudged, restricted palette, flat colors, noisy, artifacts, fake, generated, overblown, over exposed" -p "Photograph of a lovely cat. Green rolling hills in the background." --clip-on-cpu --offload-to-cpu -v -W 768 -H 1024 --diffusion-fa --chroma-disable-dit-mask --easycache 0.025,0.15,0.95

Nov 06 '25 10:11 Green-Sky

Ran it with a SPARK (preview) Chroma finetune

thresh img real speedup 0 baseline 0.025 1.12x 0.1 1.31x 0.2 1.4x I noticed the estimated speedup is off. It is off exactly by 2x, so I guess cfg is not handled properly yet.

eg. 40steps with cfg are actually 80 steps
40/(40-8) -> 1.25 (estimated 1.25)
80/(80-8) -> 1.11 (measured 1.12)
I included 0.025, because that is what the original used for wan.

Anyway Good stuff, I take the 11% speedup.

$ result/bin/sd --diffusion-model models/SPARK.Chroma_preview-q5_k.gguf --t5xxl models/flux-extra/t5xxl_fp16.safetensors -t 8 --vae models/flux-extra/ae-f16.gguf --sampling-method dpm++2m --scheduler simple --steps 40 --cfg-scale 3.8 -n "low quality, ugly, unfinished, out of focus, deformed, disfigure, blurry, smudged, restricted palette, flat colors, noisy, artifacts, fake, generated, overblown, over exposed" -p "Photograph of a lovely cat. Green rolling hills in the background." --clip-on-cpu --offload-to-cpu -v -W 768 -H 1024 --diffusion-fa --chroma-disable-dit-mask --easycache 0.025,0.15,0.95

Is this additional noise in the image expected? I haven’t noticed it in other Tea Cache tests or similar methods.

Nov 06 '25 14:11 JohnLoveJoy

Is this additional noise in the image expected? I haven’t noticed it in other Tea Cache tests or similar methods.

I speculate that this happens with some models (see second pic OP) when too much is skipped at the end of the sampling phase. You can lower the second value from 0.95 to 0.80 or lower. This will result in less skips late, but also less skips overall, so it is not conclusive(!).

I observed similar behavoir with beta/smoothstep schedulers, which reduce the noise in those models. The scheduler spends more time in early and late timesteps of the sampling (s-cuve or gain function behavoir).

edit: Ofc this pr can be broken too. Also other factors is the quant used, which seems to exacerbate the noise.

edit2: Here is what smoothstep(almost beta) with 20 steps looks like: output

Nov 06 '25 14:11 Green-Sky

I think it would be simpler if --easycache threshold,start_percent,end_percent is split into 3 or 4 commands. You generally just want to enable the safe default OR only modify threshold. Or maybe make star and end optional? Not sure.

Nov 07 '25 17:11 Green-Sky

I think it would be simpler if --easycache threshold,start_percent,end_percent is split into 3 or 4 commands. You generally just want to enable the safe default OR only modify threshold. Or maybe make star and end optional? Not sure.

Thanks for the suggestion! I think we can just make everything optional, but the quality seems to vary depending on the model/quant type, so this feature is definitely for "advanced" users I'd say. I’ll come back to this PR this weekend

Nov 07 '25 20:11 rmatif

It seems the results for the WAN video model are not very good. I'm not sure if something is wrong somewhere.

Without easycache

[DEBUG] stable-diffusion.cpp:3856 - sample 104x60x9
[DEBUG] ggml_extend.hpp:1656 - Wan2.1-T2V-1.3B compute buffer size: 2575.50 MB(VRAM)
  |==================================================| 20/20 - 2.22s/it
[INFO ] stable-diffusion.cpp:3884 - sampling completed, taking 44.32s

https://github.com/user-attachments/assets/9c7ad71b-7264-4765-8a7f-5f17814b33c5

With `--easycache 0.2,0.15,0.95`

[DEBUG] stable-diffusion.cpp:3856 - sample 104x60x9
[INFO ] stable-diffusion.cpp:1778 - EasyCache enabled - threshold: 0.200, start_percent: 0.15, end_percent: 0.95
[DEBUG] ggml_extend.hpp:1656 - Wan2.1-T2V-1.3B compute buffer size: 2575.50 MB(VRAM)
  |==================================================| 20/20 - 1.64s/it
[INFO ] stable-diffusion.cpp:2065 - EasyCache skipped 10/20 steps (2.00x estimated speedup)
[INFO ] stable-diffusion.cpp:3884 - sampling completed, taking 32.90s

https://github.com/user-attachments/assets/30c32867-04ab-4320-9f3e-5cbd21f59ec8

With `--easycache 0.05,0.15,0.95`

[DEBUG] stable-diffusion.cpp:3856 - sample 104x60x9
[INFO ] stable-diffusion.cpp:1778 - EasyCache enabled - threshold: 0.050, start_percent: 0.15, end_percent: 0.95
[DEBUG] ggml_extend.hpp:1656 - Wan2.1-T2V-1.3B compute buffer size: 2575.50 MB(VRAM)
  |==================================================| 20/20 - 1.82s/it
[INFO ] stable-diffusion.cpp:2065 - EasyCache skipped 7/20 steps (1.54x estimated speedup)
[INFO ] stable-diffusion.cpp:3884 - sampling completed, taking 36.40s

https://github.com/user-attachments/assets/b4a75cc7-7680-48d7-b9cf-946e91f2bbd8

Nov 16 '25 09:11 leejet

The values 0.025,0.15,0.95 worked pretty ok with wan2.2 5B. Those are also the defaults in the original implementation. I will update an rerun that later.

edit: see https://github.com/leejet/stable-diffusion.cpp/issues/943

Nov 16 '25 10:11 Green-Sky

reworked the sampling loop so EasyCache participates in every diffusion-model call instead of just the first one before, CFG pairs should now stay in sync

@Green-Sky

results for 0.2,0.15,0.95 now

@leejet

Results on wan2.1 1.3B

Without easycache

https://github.com/user-attachments/assets/06c3188e-b719-42f3-9dc7-1dc44e9f816b

With easycache 0.2,0.15,0.95

https://github.com/user-attachments/assets/ccd387fd-69ca-4acb-9881-80c256d4ef2d

I think this is ready to be merged

Nov 17 '25 21:11 rmatif

reworked the sampling loop so EasyCache participates in every diffusion-model call instead of just the first one before, CFG pairs should now stay in sync

Great! It now works and looks good in most cases for 0.2 AND the estimates do now match. :rocket: