Please, add support for Segmind Distilled diffusion models
I found ckpt versions of Segmind Distilles diffusion ( https://github.com/segmind/distill-sd, https://huggingface.co/segmind ) models:
https://huggingface.co/ClashSAN/small-sd/resolve/main/smallSDdistilled.ckpt https://huggingface.co/ClashSAN/small-sd/resolve/main/tinySDdistilled.ckpt
I ran convert.py script from your repo to make ggml f32 quant for tinySDdistilled.ckpt. Then i tried to launch generated ggml in stable-diffusion.cpp but got this error:
[ERROR] stable-diffusion.cpp:2898 - tensor 'model.diffusion_model.output_blocks.1.0.in_layers.0.weight' has wrong shape in model file: got [1920, 1, 1, 1], expected [2560, 1, 1, 1]
@JohnClaw actually I think that to get a really small memory footprint, we can instead quantize additional layers on the already well-trained models available. It takes far less effort, and good results from dynamic shapes already available. when you use that distilled model to generate portraits, there is a very high chance the faces are duplicated, meaning it can only make squares well ATM.
The models with 22% at 4bit already have good results: its no different from the f16 for small sizes.
With onnx int8 models, the int8 quantized model running in onnx at 256x192 will only take 0.7gb. However the excessive quantization makes ugly pictures sometimes:
256x384
Also, I dont think onnx supports mixed 4 bit quantized models, or supports 32bit arm architecture well, making this a better option for extremely miniature cpu generations.
looking at the results produced in stable-diffusion-cpp, they are already pretty good for a sticker generator or profile picture generation, maybe with a pixelization option you can fit in an rpg game.
128x192 14 steps q4_0 (non-cherrypicked)
It's probably better to pick quantization settings for existing models than work with distilled models because they will do very similar things.
If it ever goes below 1gb, since this compiles successfully in 32bit, Tiny pictures could probably run on any digital device in existence, old androids, smart tv box, it can fit be it in apps, home assistants, whatever you can dream of!
Due to the structural differences between this model and the original one, using the weights directly could lead to errors. I'll take some time to see if it's necessary to add corresponding support.
Due to the structural differences between this model and the original one, using the weights directly could lead to errors. I'll take some time to see if it's necessary to add corresponding support.
Thank you very much.
Due to the structural differences between this model and the original one, using the weights directly could lead to errors. I'll take some time to see if it's necessary to add corresponding support.
Could you answer my other question, please? https://github.com/leejet/stable-diffusion.cpp/issues/28#issuecomment-1694486320
Could you answer my other question, please? https://github.com/leejet/stable-diffusion.cpp/issues/28#issuecomment-1694486320
I have responded on corresponding issue
If it ever goes below 1gb, since this compiles successfully in 32bit, Tiny pictures could probably run on any digital device in existence, old androids, smart tv box, it can fit be it in apps, home assistants, whatever you can dream of!
There is a project vitoplantamura/OnnxStream that can run Stable Diffusion on a Raspberry Pi Zero 2 (or in 260MB of RAM). I wonder if the same method can be applied here. Sorry if I was being annoying though.
There is a project vitoplantamura/OnnxStream that can run Stable Diffusion on a Raspberry Pi Zero 2 (or in 260MB of RAM). I wonder if the same method can be applied here. Sorry if I was being annoying though.
It's slower than stable-diffusion.cpp. On Ryzen 7 4700u 1 step needs 42-43 seconds to be done. SD.cpp needs only 27-28 seconds.
I second this request. It would be absolutely amazing to have support for distilled models. For TinySD (https://huggingface.co/segmind/tiny-sd), we have a UNET size of 617 MB, which is half of an fp16_q4_0, and it can be even smaller if quantized.
I faced significant challenges while trying to run sdcpp through Termux on my Redmi Note 8T with 4 GB of RAM. Even with flash attention at 4 bits and extensive memory swapping, I couldn't generate an image beyond a size of 64x64. At 128x128, I encountered a crash. I believe that addressing the few megabytes of missing memory could be possible by exploring support for these distilled models.
@rmatif Were you able to run tiny-sd model? let me know