stable-diffusion.cpp Please, add support for Segmind Distilled diffusion models

I found ckpt versions of Segmind Distilles diffusion ( https://github.com/segmind/distill-sd, https://huggingface.co/segmind ) models:

https://huggingface.co/ClashSAN/small-sd/resolve/main/smallSDdistilled.ckpt https://huggingface.co/ClashSAN/small-sd/resolve/main/tinySDdistilled.ckpt

I ran convert.py script from your repo to make ggml f32 quant for tinySDdistilled.ckpt. Then i tried to launch generated ggml in stable-diffusion.cpp but got this error:

[ERROR] stable-diffusion.cpp:2898 - tensor 'model.diffusion_model.output_blocks.1.0.in_layers.0.weight' has wrong shape in model file: got [1920, 1, 1, 1], expected [2560, 1, 1, 1]

Aug 27 '23 18:08 JohnClaw

@JohnClaw actually I think that to get a really small memory footprint, we can instead quantize additional layers on the already well-trained models available. It takes far less effort, and good results from dynamic shapes already available. when you use that distilled model to generate portraits, there is a very high chance the faces are duplicated, meaning it can only make squares well ATM.

The models with 22% at 4bit already have good results: its no different from the f16 for small sizes.

With onnx int8 models, the int8 quantized model running in onnx at 256x192 will only take 0.7gb. However the excessive quantization makes ugly pictures sometimes:

256x384

testint8_2 testint8_34

Also, I dont think onnx supports mixed 4 bit quantized models, or supports 32bit arm architecture well, making this a better option for extremely miniature cpu generations.

looking at the results produced in stable-diffusion-cpp, they are already pretty good for a sticker generator or profile picture generation, maybe with a pixelization option you can fit in an rpg game.

128x192 14 steps q4_0 (non-cherrypicked)

It's probably better to pick quantization settings for existing models than work with distilled models because they will do very similar things.

Aug 27 '23 22:08 ClashSAN

If it ever goes below 1gb, since this compiles successfully in 32bit, Tiny pictures could probably run on any digital device in existence, old androids, smart tv box, it can fit be it in apps, home assistants, whatever you can dream of!

Aug 27 '23 22:08 ClashSAN

Due to the structural differences between this model and the original one, using the weights directly could lead to errors. I'll take some time to see if it's necessary to add corresponding support.

Aug 28 '23 13:08 leejet

Due to the structural differences between this model and the original one, using the weights directly could lead to errors. I'll take some time to see if it's necessary to add corresponding support.

Thank you very much.

Aug 28 '23 14:08 JohnClaw

Due to the structural differences between this model and the original one, using the weights directly could lead to errors. I'll take some time to see if it's necessary to add corresponding support.

Could you answer my other question, please? https://github.com/leejet/stable-diffusion.cpp/issues/28#issuecomment-1694486320

Aug 28 '23 14:08 JohnClaw

Could you answer my other question, please? https://github.com/leejet/stable-diffusion.cpp/issues/28#issuecomment-1694486320

I have responded on corresponding issue

Aug 28 '23 15:08 leejet

If it ever goes below 1gb, since this compiles successfully in 32bit, Tiny pictures could probably run on any digital device in existence, old androids, smart tv box, it can fit be it in apps, home assistants, whatever you can dream of!

There is a project vitoplantamura/OnnxStream that can run Stable Diffusion on a Raspberry Pi Zero 2 (or in 260MB of RAM). I wonder if the same method can be applied here. Sorry if I was being annoying though.

Aug 29 '23 16:08 nviet

There is a project vitoplantamura/OnnxStream that can run Stable Diffusion on a Raspberry Pi Zero 2 (or in 260MB of RAM). I wonder if the same method can be applied here. Sorry if I was being annoying though.

It's slower than stable-diffusion.cpp. On Ryzen 7 4700u 1 step needs 42-43 seconds to be done. SD.cpp needs only 27-28 seconds.

Aug 29 '23 18:08 JohnClaw

I second this request. It would be absolutely amazing to have support for distilled models. For TinySD (https://huggingface.co/segmind/tiny-sd), we have a UNET size of 617 MB, which is half of an fp16_q4_0, and it can be even smaller if quantized.

I faced significant challenges while trying to run sdcpp through Termux on my Redmi Note 8T with 4 GB of RAM. Even with flash attention at 4 bits and extensive memory swapping, I couldn't generate an image beyond a size of 64x64. At 128x128, I encountered a crash. I believe that addressing the few megabytes of missing memory could be possible by exploring support for these distilled models.

Jan 06 '24 14:01 rmatif

@rmatif Were you able to run tiny-sd model? let me know

Feb 27 '25 10:02 Rehaman1429