DeepSeek MoE support
This PR adds support for DeepSeek MoE https://huggingface.co/deepseek-ai/deepseek-moe-16b-base
Main differences from Mixtral:
- Shared experts
- First layers are dense
- MoE normalization disabled
Build:
cd TensorRT-LLM/examples/llama
python convert_checkpoint.py --model_dir /models/deepseek-moe-16b-base/ --dtype float16 --output_dir /trtllm/deepseek-moe-16b-base/1-gpu-tmp/
trtllm-build --checkpoint_dir /trtllm/deepseek-moe-16b-base/1-gpu-tmp/ --output_dir /trtllm/deepseek-moe-16b-base/1-gpu --max_batch_size 32 --max_input_len 3072 --max_output_len 1024 --max_num_tokens 32768 --gpt_attention_plugin float16 --gemm_plugin float16 --context_fmha enable --paged_kv_cache enable --remove_input_padding enable --use_paged_context_fmha enable
Run:
cd TensorRT-LLM/examples/
python run.py --engine_dir /trtllm/deepseek-moe-16b-base/1-gpu --tokenizer_dir /models/deepseek-moe-16b-base/ --max_output_len 32 --top_p 0 --input_text
"The president of the United States is person who"
TensorRt-LLM Output:
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024060400
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Input [Text 0]: "<|begin▁of▁sentence|>The president of the United States is person who"
Output [Text 0 Beam 0]: " is elected by the people of the United States to lead the country. The president is the head of the executive branch of the government. The president is the commander"
Transformers Output:
>>> tokenizer.batch_decode(model.generate(torch.LongTensor([tokenizer.encode("The president of the United States is person who")]).cuda(), max_new_tokens=32, do_sample=False))
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:100001 for open-end generation.
['<|begin▁of▁sentence|>The president of the United States is person who is elected by the people of the United States to lead the country. The president is the head of the executive branch of the government. The president is the commander']
Thanks @akhoroshev for your contribution to TRT-LLM. My suggestion is to use the dedidacted model definition for the newly added MoE models instead of reuse the llama model. We do have plan to create the unique mixtral and arctic example in the coming release.
I'm not sure such efforts is acceptable for you or not. If u're not willing to refactor code in this way, we can do that later after this MR merged.
Hi @akhoroshev, first off thanks for the contribution. I agree with @nv-guomingz about having this be a separate model, but also that this is something we could handle separately after this MR.
My second comment is that we have done some work for other shared experts and settled on a slightly different convention for the shared expert design. Instead of modifying the MOE plugin we instead use an unmodified MOE and combine it with an MLP layer for the shared experts at the DecoderLayer level. We are not necessarily committed to one design or the other, so I will discuss with others working on this and decide how best to unify the design with what you have here.
My final note is that I would like to see a more general version of the DenseReplaceConfig that instead takes a list of layers that are marked as dense or moe, and then have a function is_moe_layer(layer_idx) to check.
Please let us know if you are interested in helping with this, otherwise we can look into finding someone to take it forward from here
I agree with @nv-guomingz about having this be a separate model, but also that this is something we could handle separately after this MR.
I agree that DeepSeek and other MoE architectures need a separate folder, and I think the trtllm team will be able to do the job after this MR for example.
My second comment is that we have done some work for other shared experts and settled on a slightly different convention for the shared expert design. Instead of modifying the MOE plugin we instead use an unmodified MOE and combine it with an MLP layer for the shared experts at the DecoderLayer level. We are not necessarily committed to one design or the other, so I will discuss with others working on this and decide how best to unify the design with what you have here.
Ok, I'll wait for the results of the discussion
My final note is that I would like to see a more general version of the DenseReplaceConfig that instead takes a list of layers that are marked as dense or moe, and then have a function is_moe_layer(layer_idx) to check.
I agree that the is_moe_layer function is better. But what about dense_intermidiate_size param? It's ok or we need more general solution?
Please let us know if you are interested in helping with this, otherwise we can look into finding someone to take it forward from here
I'm interested
I agree that the is_moe_layer function is better. But what about dense_intermidiate_size param? It's ok or we need more general solution?
This is a good question, perhaps a list of DenseConfig and MOEConfig options would be the best approach that can store the details of the config for each layer. Then we can have two functions: is_moe_layer(layer_idx) -> bool and then corresponding get_layer_config(layer_idx) -> MOEConfig|DenseConfig. Open to suggestions if you have any other ideas.
I agree that the is_moe_layer function is better. But what about dense_intermidiate_size param? It's ok or we need more general solution?
This is a good question, perhaps a list of
DenseConfigandMOEConfigoptions would be the best approach that can store the details of the config for each layer. Then we can have two functions:is_moe_layer(layer_idx) -> booland then correspondingget_layer_config(layer_idx) -> MOEConfig|DenseConfig. Open to suggestions if you have any other ideas.
@dataclass
class DenseConfig:
intermediate_size: int
hidden_act: str
@dataclass
class MoeConfig:
num_experts: int
top_k: int
tp_mode: ParallelismMode
num_shared_experts: int
intermediate_size: int
hidden_act: str
@dataclass
class LayersMLPConfig:
config: Union[DenseConfig, MoeConfig, List[Union[DenseConfig, MoeConfig]]]
def get_layer_config(layer_idx) -> Union[DenseConfig, MoeConfig]:
pass
def is_moe_layer(layer_idx) -> bool:
pass
def is_dense_layer(layer_idx) -> bool:
return not is_moe_layer(layer_idx)
@djns99 I wrote the classes for your solution above. I want to extend existing MoeConfig with intermediate_size and hidden_act members. Also I want to introduce new DenseConfig and LayersMLPConfig classes. What do you think about it?
Thanks @akhoroshev that makes perfect sense to me. Feel free to make that change to this PR if you would like
I discussed re shared experts, and the verdict was that we should implement a SharedExpertsMOE type class that handles this so we can keep the MOE class simple while also having one shared implementation. You wont have to do anything here yet, we will get that integrated internally in the next week or so.
Any progress update on this one?
@Ahmad-Magdy-Osman yep, we are working on enable DeepSeek V2 (MoE + MLA) in TRT-LLM v0.12, progress on performance benchmark and optimization at same time
@dominicshanshan Is this available on the main branch or somewhere else? I can try building the docker image and experiment with it
Hi @Ahmad-Magdy-Osman, currently these changes are being tested on our internal branch. Once they are accepted internally they will be released in one of our upcoming weekly releases. We will let you know as soon as they are available
@dominicshanshan
Hello, first of all thanks for your help with this PR. I'm too busy to do this right now.
yep, we are working on enable DeepSeek V2 (MoE + MLA) in TRT-LLM v0.12, progress on performance benchmark and optimization at same time
Will DeepSeek V1 architecture be supported?
@dominicshanshan
Hello, first of all thanks for your help with this PR. I'm too busy to do this right now.
yep, we are working on enable DeepSeek V2 (MoE + MLA) in TRT-LLM v0.12, progress on performance benchmark and optimization at same time
Will DeepSeek V1 architecture be supported?
yes
Any recent updates regarding the support for DeepSeek MOE?
Any recent updates regarding the support for DeepSeek MOE?
deepseek MOE hopefully will be appear in main branch in next week, deepseek v2 (MLA+MOE) hopefully will be appear in main at end of month, we are still working hard to improve the MLA kernel performance..
Any recent updates regarding the support for DeepSeek MOE?
deepseek MOE hopefully will be appear in main branch in next week, deepseek v2 (MLA+MOE) hopefully will be appear in main at end of month, we are still working hard to improve the MLA kernel performance..
Any recent updates regarding the support for DeepSeek MOE?
Any recent updates regarding the support for DeepSeek MOE?
deepseek MOE hopefully will be appear in main branch in next week, deepseek v2 (MLA+MOE) hopefully will be appear in main at end of month, we are still working hard to improve the MLA kernel performance..
Is it already supported?
Any recent updates regarding the support for DeepSeek MOE?
deepseek MOE hopefully will be appear in main branch in next week, deepseek v2 (MLA+MOE) hopefully will be appear in main at end of month, we are still working hard to improve the MLA kernel performance..
Any Updates?
deepseek v1 is ready to go, should appear in main branch in early next week, v2 we are still tuning , we are targeting to get the close perf as in v2 model paper demonstrate..
deepseek v1 is ready to go, should appear in main branch in early next week, v2 we are still tuning , we are targeting to get the close perf as in v2 model paper demonstrate..
Are there any recent releases? Deepseek v2 is exciting and I can't wait to try it out on trtllm.☺️
@akhoroshev , deepseek-v1 is live in main branch now, deepseek-v2 target to go live in 10.1 holiday season, thanks for community contribution!
you can still leave comment in this thread but since it fulfilled the purpose I will close for now, thanks!
@akhoroshev , deepseek-v1 is live in main branch now, deepseek-v2 target to go live in 10.1 holiday season, thanks for community contribution!
you can still leave comment in this thread but since it fulfilled the purpose I will close for now, thanks!
nice work!
Does model opt support fp8 quant for Deepseek v1?
@dominicshanshan
Does model opt support fp8 quant for Deepseek v1?
@dominicshanshan
yep, already implemented and passed CI check, should appear in main branch soon
@akhoroshev , deepseek-v1 is live in main branch now, deepseek-v2 target to go live in 10.1 holiday season, thanks for community contribution!
you can still leave comment in this thread but since it fulfilled the purpose I will close for now, thanks!
little update: we cannot get deepseek-v2 ready for main branch in 10.1 holiday season, still working on some bugs during internal testing, will update the status after holiday ..
@akhoroshev , deepseek-v1 is live in main branch now, deepseek-v2 target to go live in 10.1 holiday season, thanks for community contribution! you can still leave comment in this thread but since it fulfilled the purpose I will close for now, thanks!
little update: we cannot get deepseek-v2 ready for main branch in 10.1 holiday season, still working on some bugs during internal testing, will update the status after holiday ..
Hi @dominicshanshan, when will it be released?
@akhoroshev , deepseek-v1 is live in main branch now, deepseek-v2 target to go live in 10.1 holiday season, thanks for community contribution! you can still leave comment in this thread but since it fulfilled the purpose I will close for now, thanks!
little update: we cannot get deepseek-v2 ready for main branch in 10.1 holiday season, still working on some bugs during internal testing, will update the status after holiday ..
Hi @dominicshanshan, when will it be released?
I will let you know the status on Friday, we found there is precision issue when convert from BF16 -> FP16 and kernel output mismatch, so need some extra work to support BF16 kernel in internal accerlation library ..
@akhoroshev , deepseek-v1 is live in main branch now, deepseek-v2 target to go live in 10.1 holiday season, thanks for community contribution! you can still leave comment in this thread but since it fulfilled the purpose I will close for now, thanks!
little update: we cannot get deepseek-v2 ready for main branch in 10.1 holiday season, still working on some bugs during internal testing, will update the status after holiday ..
Hi @dominicshanshan, when will it be released?
I will let you know the status on Friday, we found there is precision issue when convert from BF16 -> FP16 and kernel output mismatch, so need some extra work to support BF16 kernel in internal accerlation library ..
@dominicshanshan Thank you for your reply. I'm looking forward to the updates. Also, I wanted to ask if support for fp8 is planned?
@akhoroshev , deepseek-v1 is live in main branch now, deepseek-v2 target to go live in 10.1 holiday season, thanks for community contribution! you can still leave comment in this thread but since it fulfilled the purpose I will close for now, thanks!
little update: we cannot get deepseek-v2 ready for main branch in 10.1 holiday season, still working on some bugs during internal testing, will update the status after holiday ..
Hi @dominicshanshan, when will it be released?
I will let you know the status on Friday, we found there is precision issue when convert from BF16 -> FP16 and kernel output mismatch, so need some extra work to support BF16 kernel in internal accerlation library ..
@dominicshanshan Thank you for your reply. I'm looking forward to the updates. Also, I wanted to ask if support for fp8 is planned?
yes, once we correct the precision issue with BF16 kernel in MLA then FP8 will be enabled, we have to make sure BF16 output is aligned with HF model output first ..
@fengyang95 , as promised, little update: we solved BF16 precision issue and tested output aligned with HF model currently, please bare us to spend some time to package up and to pass internal CI test, probably need one extra week, will update status on Wednesday, apologies again for community developers to wait so long time ..
@fengyang95 , as promised, little update: we solved BF16 precision issue and tested output aligned with HF model currently, please bare us to spend some time to package up and to pass internal CI test, probably need one extra week, will update status on Wednesday, apologies again for community developers to wait so long time ..
please,is there any update?