Support for Sparse MoE models like Camelidae and Sparsetral
A new technique for broadening the capabilities of LLMs without massively increasing resource requirements has recently emerged. Instead of shuffling tokens in between expert models like a standard MoE, it shuffles them between expert LoRAs. This gives much of the benefit of a MoE, but with very low memory and compute requirements above what is needed by the base model. Please consider adding support for inferencing and quantizing these sparse MoEs, as they are a very promising branch of LLMs that can be not only inferenced but fully trained on attainable consumer hardware.
Camelidae on HF: https://huggingface.co/hywu
Sparsetral on HF:https://huggingface.co/serpdotai/sparsetral-16x7B-v2
add https://github.com/serp-ai/Parameter-Efficient-MoE for reference
Another sparse MoE implimentation: https://github.com/predibase/lorax
They make a lot of claims that are big if true. But who doesn't these days? https://predibase.com/blog/lora-land-fine-tuned-open-source-llms-that-outperform-gpt-4
Another sparse MoE implimentation: https://github.com/predibase/lorax
They make a lot of claims that are big if true. But who doesn't these days? https://predibase.com/blog/lora-land-fine-tuned-open-source-llms-that-outperform-gpt-4
that's not the same thing. sparsetral use 4 lora at the same time to improve overall capability
https://arxiv.org/abs/2401.02731
It's very space efficient or vram efficient
@ggerganov Just a quick question. What do you think of this type of MoE? How many effort is needed to implement it? This looks like a efficient way to improve existing model.
I suppose it's not difficult to add support