llama.cpp Support for Sparse MoE models like Camelidae and Sparsetral

A new technique for broadening the capabilities of LLMs without massively increasing resource requirements has recently emerged. Instead of shuffling tokens in between expert models like a standard MoE, it shuffles them between expert LoRAs. This gives much of the benefit of a MoE, but with very low memory and compute requirements above what is needed by the base model. Please consider adding support for inferencing and quantizing these sparse MoEs, as they are a very promising branch of LLMs that can be not only inferenced but fully trained on attainable consumer hardware.

Camelidae on HF: https://huggingface.co/hywu

Sparsetral on HF:https://huggingface.co/serpdotai/sparsetral-16x7B-v2

Feb 06 '24 15:02 candre23

add https://github.com/serp-ai/Parameter-Efficient-MoE for reference

Feb 14 '24 09:02 sorasoras

Another sparse MoE implimentation: https://github.com/predibase/lorax

They make a lot of claims that are big if true. But who doesn't these days? https://predibase.com/blog/lora-land-fine-tuned-open-source-llms-that-outperform-gpt-4

Feb 20 '24 19:02 candre23

Another sparse MoE implimentation: https://github.com/predibase/lorax

They make a lot of claims that are big if true. But who doesn't these days? https://predibase.com/blog/lora-land-fine-tuned-open-source-llms-that-outperform-gpt-4

that's not the same thing. sparsetral use 4 lora at the same time to improve overall capability

https://arxiv.org/abs/2401.02731

Screenshot 2024-03-04 222630 Screenshot 2024-03-04 222610

It's very space efficient or vram efficient

Mar 04 '24 14:03 sorasoras

@ggerganov Just a quick question. What do you think of this type of MoE? How many effort is needed to implement it? This looks like a efficient way to improve existing model.

Mar 05 '24 11:03 sorasoras

I suppose it's not difficult to add support

Mar 05 '24 13:03 ggerganov