TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

[Feature Request] Mixtral Offloading

Open shixianc opened this issue 2 years ago • 2 comments

There's a new cache technique mentioned in the paper https://arxiv.org/abs/2312.17238. (github: https://github.com/dvmazur/mixtral-offloading) They introduced LRU cache to cache experts based on patterns they found, and also took speculative guess to pre-load experts before the computation of the next layer. The result looks quite promising. Can we support it for Mixtral? This helps a lot to run on smaller GPUs.

shixianc avatar Jan 09 '24 17:01 shixianc

Thanks for highlighting this - that's a very good suggestion to save memory. We'll evaluate what it would take to support it in TRT-LLM

ncomly-nvidia avatar Jan 22 '24 21:01 ncomly-nvidia

May also consider MoE-Infinity, a cost-efficient mixture-of-expert (MoE) serving system that realizes activation-aware expert offloading. :)

shiqingzhangCSU avatar Apr 26 '24 06:04 shiqingzhangCSU