feat: add easycache support
This PR adds support for Easycache, a variant of TeaCache that achieves significant speedup
Currently tested only with CUDA and on Flux/Qwen
Command usage:
--easycache threshold,start_percent,end_percent
Examples:
| Without Easycache | --easycache 0.2,0.15,0.95 |
Speedup |
|---|---|---|
| x1.85 | ||
| x1.85 |
[!NOTE] Outdated
Ran it with a SPARK (preview) Chroma finetune
| thresh | img | real speedup |
|---|---|---|
| 0 | baseline | |
| 0.025 | 1.12x | |
| 0.1 | 1.31x | |
| 0.2 | 1.4x |
I noticed the estimated speedup is off. It is off exactly by 2x, so I guess cfg is not handled properly yet.
eg. 40steps with cfg are actually 80 steps
40/(40-8) -> 1.25 (estimated 1.25)
80/(80-8) -> 1.11 (measured 1.12)
I included 0.025, because that is what the original used for wan.
Anyway Good stuff, I take the 11% speedup.
$ result/bin/sd --diffusion-model models/SPARK.Chroma_preview-q5_k.gguf --t5xxl models/flux-extra/t5xxl_fp16.safetensors -t 8 --vae models/flux-extra/ae-f16.gguf --sampling-method dpm++2m --scheduler simple --steps 40 --cfg-scale 3.8 -n "low quality, ugly, unfinished, out of focus, deformed, disfigure, blurry, smudged, restricted palette, flat colors, noisy, artifacts, fake, generated, overblown, over exposed" -p "Photograph of a lovely cat. Green rolling hills in the background." --clip-on-cpu --offload-to-cpu -v -W 768 -H 1024 --diffusion-fa --chroma-disable-dit-mask --easycache 0.025,0.15,0.95
Ran it with a SPARK (preview) Chroma finetune
thresh img real speedup 0
baseline 0.025
1.12x 0.1
1.31x 0.2
1.4x I noticed the estimated speedup is off. It is off exactly by 2x, so I guess cfg is not handled properly yet.
eg. 40steps with cfg are actually 80 steps
40/(40-8) -> 1.25 (estimated 1.25) 80/(80-8) -> 1.11 (measured 1.12)I included 0.025, because that is what the original used for wan.
Anyway Good stuff, I take the 11% speedup.
$ result/bin/sd --diffusion-model models/SPARK.Chroma_preview-q5_k.gguf --t5xxl models/flux-extra/t5xxl_fp16.safetensors -t 8 --vae models/flux-extra/ae-f16.gguf --sampling-method dpm++2m --scheduler simple --steps 40 --cfg-scale 3.8 -n "low quality, ugly, unfinished, out of focus, deformed, disfigure, blurry, smudged, restricted palette, flat colors, noisy, artifacts, fake, generated, overblown, over exposed" -p "Photograph of a lovely cat. Green rolling hills in the background." --clip-on-cpu --offload-to-cpu -v -W 768 -H 1024 --diffusion-fa --chroma-disable-dit-mask --easycache 0.025,0.15,0.95
Is this additional noise in the image expected? I haven’t noticed it in other Tea Cache tests or similar methods.
Is this additional noise in the image expected? I haven’t noticed it in other Tea Cache tests or similar methods.
I speculate that this happens with some models (see second pic OP) when too much is skipped at the end of the sampling phase. You can lower the second value from 0.95 to 0.80 or lower. This will result in less skips late, but also less skips overall, so it is not conclusive(!).
I observed similar behavoir with beta/smoothstep schedulers, which reduce the noise in those models. The scheduler spends more time in early and late timesteps of the sampling (s-cuve or gain function behavoir).
edit: Ofc this pr can be broken too. Also other factors is the quant used, which seems to exacerbate the noise.
edit2: Here is what smoothstep(almost beta) with 20 steps looks like:
I think it would be simpler if --easycache threshold,start_percent,end_percent is split into 3 or 4 commands. You generally just want to enable the safe default OR only modify threshold.
Or maybe make star and end optional? Not sure.
I think it would be simpler if
--easycache threshold,start_percent,end_percentis split into 3 or 4 commands. You generally just want to enable the safe default OR only modifythreshold. Or maybe make star and end optional? Not sure.
Thanks for the suggestion! I think we can just make everything optional, but the quality seems to vary depending on the model/quant type, so this feature is definitely for "advanced" users I'd say. I’ll come back to this PR this weekend
It seems the results for the WAN video model are not very good. I'm not sure if something is wrong somewhere.
Without easycache
[DEBUG] stable-diffusion.cpp:3856 - sample 104x60x9
[DEBUG] ggml_extend.hpp:1656 - Wan2.1-T2V-1.3B compute buffer size: 2575.50 MB(VRAM)
|==================================================| 20/20 - 2.22s/it
[INFO ] stable-diffusion.cpp:3884 - sampling completed, taking 44.32s
https://github.com/user-attachments/assets/9c7ad71b-7264-4765-8a7f-5f17814b33c5
With --easycache 0.2,0.15,0.95
[DEBUG] stable-diffusion.cpp:3856 - sample 104x60x9
[INFO ] stable-diffusion.cpp:1778 - EasyCache enabled - threshold: 0.200, start_percent: 0.15, end_percent: 0.95
[DEBUG] ggml_extend.hpp:1656 - Wan2.1-T2V-1.3B compute buffer size: 2575.50 MB(VRAM)
|==================================================| 20/20 - 1.64s/it
[INFO ] stable-diffusion.cpp:2065 - EasyCache skipped 10/20 steps (2.00x estimated speedup)
[INFO ] stable-diffusion.cpp:3884 - sampling completed, taking 32.90s
https://github.com/user-attachments/assets/30c32867-04ab-4320-9f3e-5cbd21f59ec8
With --easycache 0.05,0.15,0.95
[DEBUG] stable-diffusion.cpp:3856 - sample 104x60x9
[INFO ] stable-diffusion.cpp:1778 - EasyCache enabled - threshold: 0.050, start_percent: 0.15, end_percent: 0.95
[DEBUG] ggml_extend.hpp:1656 - Wan2.1-T2V-1.3B compute buffer size: 2575.50 MB(VRAM)
|==================================================| 20/20 - 1.82s/it
[INFO ] stable-diffusion.cpp:2065 - EasyCache skipped 7/20 steps (1.54x estimated speedup)
[INFO ] stable-diffusion.cpp:3884 - sampling completed, taking 36.40s
https://github.com/user-attachments/assets/b4a75cc7-7680-48d7-b9cf-946e91f2bbd8
The values 0.025,0.15,0.95 worked pretty ok with wan2.2 5B. Those are also the defaults in the original implementation. I will update an rerun that later.
edit: see https://github.com/leejet/stable-diffusion.cpp/issues/943
reworked the sampling loop so EasyCache participates in every diffusion-model call instead of just the first one before, CFG pairs should now stay in sync
@Green-Sky
results for 0.2,0.15,0.95 now
@leejet
Results on wan2.1 1.3B
Without easycache
https://github.com/user-attachments/assets/06c3188e-b719-42f3-9dc7-1dc44e9f816b
With easycache 0.2,0.15,0.95
https://github.com/user-attachments/assets/ccd387fd-69ca-4acb-9881-80c256d4ef2d
I think this is ready to be merged
reworked the sampling loop so EasyCache participates in every diffusion-model call instead of just the first one before, CFG pairs should now stay in sync
Great! It now works and looks good in most cases for 0.2 AND the estimates do now match. :rocket:
| thresh | img | real speedup |
|---|---|---|
| 0 | 1.00x (355.82s) | |
| 0.025 | 1.28x (277.47s) | |
| 0.1 | 2.0x (176.18s) | |
| 0.2 | 2.37x (150.00s) | |
| 0.3 | 2.50 (142.11s) |
I think this is ready to be merged
Yes, I think this is ready for review.
Did someone test heun yet? Not that you would combine heun and easycache, but still.
LGTM. Thank you for your contribution!
Without easycache
https://github.com/user-attachments/assets/a8fd06c0-97bc-4dd3-b8ba-6d0c5f49a951
With easycache
https://github.com/user-attachments/assets/cd8b2ddd-2117-4567-9bf1-5fda24ea462c
baseline
0.025
1.12x
0.1
1.31x
0.2
1.4x
I noticed the estimated speedup is off. It is off exactly by 2x, so I guess cfg is not handled properly yet.