TensorRT-LLM DeepSeek MoE support

This PR adds support for DeepSeek MoE https://huggingface.co/deepseek-ai/deepseek-moe-16b-base

Main differences from Mixtral:

Shared experts
First layers are dense
MoE normalization disabled

Build:

cd TensorRT-LLM/examples/llama
python convert_checkpoint.py --model_dir /models/deepseek-moe-16b-base/ --dtype float16  --output_dir /trtllm/deepseek-moe-16b-base/1-gpu-tmp/
trtllm-build --checkpoint_dir /trtllm/deepseek-moe-16b-base/1-gpu-tmp/ --output_dir /trtllm/deepseek-moe-16b-base/1-gpu  --max_batch_size 32 --max_input_len 3072  --max_output_len 1024 --max_num_tokens 32768  --gpt_attention_plugin float16  --gemm_plugin float16  --context_fmha  enable  --paged_kv_cache enable   --remove_input_padding  enable  --use_paged_context_fmha enable

Run:

cd TensorRT-LLM/examples/
python run.py --engine_dir /trtllm/deepseek-moe-16b-base/1-gpu --tokenizer_dir /models/deepseek-moe-16b-base/ --max_output_len 32 --top_p 0 --input_text 
"The president of the United States is person who"

TensorRt-LLM Output:

[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024060400
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Input [Text 0]: "<｜begin▁of▁sentence｜>The president of the United States is person who"
Output [Text 0 Beam 0]: " is elected by the people of the United States to lead the country. The president is the head of the executive branch of the government. The president is the commander"

Transformers Output:

>>> tokenizer.batch_decode(model.generate(torch.LongTensor([tokenizer.encode("The president of the United States is person who")]).cuda(), max_new_tokens=32, do_sample=False))
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:100001 for open-end generation.
['<｜begin▁of▁sentence｜>The president of the United States is person who is elected by the people of the United States to lead the country. The president is the head of the executive branch of the government. The president is the commander']

Jun 09 '24 18:06 akhoroshev

Thanks @akhoroshev for your contribution to TRT-LLM. My suggestion is to use the dedidacted model definition for the newly added MoE models instead of reuse the llama model. We do have plan to create the unique mixtral and arctic example in the coming release.

I'm not sure such efforts is acceptable for you or not. If u're not willing to refactor code in this way, we can do that later after this MR merged.

Jun 11 '24 05:06 nv-guomingz

Hi @akhoroshev, first off thanks for the contribution. I agree with @nv-guomingz about having this be a separate model, but also that this is something we could handle separately after this MR.

My second comment is that we have done some work for other shared experts and settled on a slightly different convention for the shared expert design. Instead of modifying the MOE plugin we instead use an unmodified MOE and combine it with an MLP layer for the shared experts at the DecoderLayer level. We are not necessarily committed to one design or the other, so I will discuss with others working on this and decide how best to unify the design with what you have here.

My final note is that I would like to see a more general version of the DenseReplaceConfig that instead takes a list of layers that are marked as dense or moe, and then have a function is_moe_layer(layer_idx) to check.

Please let us know if you are interested in helping with this, otherwise we can look into finding someone to take it forward from here

Jun 11 '24 05:06 djns99

I agree with @nv-guomingz about having this be a separate model, but also that this is something we could handle separately after this MR.

I agree that DeepSeek and other MoE architectures need a separate folder, and I think the trtllm team will be able to do the job after this MR for example.

My second comment is that we have done some work for other shared experts and settled on a slightly different convention for the shared expert design. Instead of modifying the MOE plugin we instead use an unmodified MOE and combine it with an MLP layer for the shared experts at the DecoderLayer level. We are not necessarily committed to one design or the other, so I will discuss with others working on this and decide how best to unify the design with what you have here.

Ok, I'll wait for the results of the discussion

My final note is that I would like to see a more general version of the DenseReplaceConfig that instead takes a list of layers that are marked as dense or moe, and then have a function is_moe_layer(layer_idx) to check.

I agree that the is_moe_layer function is better. But what about dense_intermidiate_size param? It's ok or we need more general solution?

Please let us know if you are interested in helping with this, otherwise we can look into finding someone to take it forward from here

I'm interested

Jun 11 '24 07:06 akhoroshev

I agree that the is_moe_layer function is better. But what about dense_intermidiate_size param? It's ok or we need more general solution?

This is a good question, perhaps a list of DenseConfig and MOEConfig options would be the best approach that can store the details of the config for each layer. Then we can have two functions: is_moe_layer(layer_idx) -> bool and then corresponding get_layer_config(layer_idx) -> MOEConfig|DenseConfig. Open to suggestions if you have any other ideas.

Jun 11 '24 23:06 djns99

I agree that the is_moe_layer function is better. But what about dense_intermidiate_size param? It's ok or we need more general solution?

This is a good question, perhaps a list of DenseConfig and MOEConfig options would be the best approach that can store the details of the config for each layer. Then we can have two functions: is_moe_layer(layer_idx) -> bool and then corresponding get_layer_config(layer_idx) -> MOEConfig|DenseConfig. Open to suggestions if you have any other ideas.

@dataclass
class DenseConfig:
  intermediate_size: int
  hidden_act: str


@dataclass
class MoeConfig:
  num_experts: int
  top_k: int
  tp_mode: ParallelismMode
  num_shared_experts: int

  intermediate_size: int
  hidden_act: str


@dataclass
class LayersMLPConfig:
  config: Union[DenseConfig, MoeConfig, List[Union[DenseConfig, MoeConfig]]]

  def get_layer_config(layer_idx) -> Union[DenseConfig, MoeConfig]:
    pass

  def is_moe_layer(layer_idx) -> bool:
    pass

  def is_dense_layer(layer_idx) -> bool:
    return not is_moe_layer(layer_idx)

@djns99 I wrote the classes for your solution above. I want to extend existing MoeConfig with intermediate_size and hidden_act members. Also I want to introduce new DenseConfig and LayersMLPConfig classes. What do you think about it?

Jun 12 '24 13:06 akhoroshev

Thanks @akhoroshev that makes perfect sense to me. Feel free to make that change to this PR if you would like

I discussed re shared experts, and the verdict was that we should implement a SharedExpertsMOE type class that handles this so we can keep the MOE class simple while also having one shared implementation. You wont have to do anything here yet, we will get that integrated internally in the next week or so.

Jun 13 '24 03:06 djns99

Any progress update on this one?

Jul 06 '24 10:07 TheAhmadOsman

@Ahmad-Magdy-Osman yep, we are working on enable DeepSeek V2 (MoE + MLA) in TRT-LLM v0.12, progress on performance benchmark and optimization at same time

Jul 07 '24 04:07 dominicshanshan

@dominicshanshan Is this available on the main branch or somewhere else? I can try building the docker image and experiment with it

Jul 07 '24 06:07 TheAhmadOsman

Hi @Ahmad-Magdy-Osman, currently these changes are being tested on our internal branch. Once they are accepted internally they will be released in one of our upcoming weekly releases. We will let you know as soon as they are available

Jul 07 '24 21:07 djns99

@dominicshanshan

Hello, first of all thanks for your help with this PR. I'm too busy to do this right now.

yep, we are working on enable DeepSeek V2 (MoE + MLA) in TRT-LLM v0.12, progress on performance benchmark and optimization at same time

Will DeepSeek V1 architecture be supported?

Jul 10 '24 14:07 akhoroshev

@dominicshanshan

Hello, first of all thanks for your help with this PR. I'm too busy to do this right now.

yep, we are working on enable DeepSeek V2 (MoE + MLA) in TRT-LLM v0.12, progress on performance benchmark and optimization at same time

Will DeepSeek V1 architecture be supported?

yes

Jul 12 '24 03:07 dominicshanshan

Any recent updates regarding the support for DeepSeek MOE?

Aug 05 '24 04:08 halexan

Any recent updates regarding the support for DeepSeek MOE?

deepseek MOE hopefully will be appear in main branch in next week, deepseek v2 (MLA+MOE) hopefully will be appear in main at end of month, we are still working hard to improve the MLA kernel performance..

Aug 05 '24 14:08 dominicshanshan

Any recent updates regarding the support for DeepSeek MOE?

deepseek MOE hopefully will be appear in main branch in next week, deepseek v2 (MLA+MOE) hopefully will be appear in main at end of month, we are still working hard to improve the MLA kernel performance..

Any recent updates regarding the support for DeepSeek MOE?

Aug 27 '24 08:08 Xu-Chen

Any recent updates regarding the support for DeepSeek MOE?

deepseek MOE hopefully will be appear in main branch in next week, deepseek v2 (MLA+MOE) hopefully will be appear in main at end of month, we are still working hard to improve the MLA kernel performance..

Is it already supported?

Aug 27 '24 10:08 bobbych94

Any recent updates regarding the support for DeepSeek MOE?

deepseek MOE hopefully will be appear in main branch in next week, deepseek v2 (MLA+MOE) hopefully will be appear in main at end of month, we are still working hard to improve the MLA kernel performance..

Any Updates？

Aug 29 '24 07:08 Missmiaom

deepseek v1 is ready to go, should appear in main branch in early next week, v2 we are still tuning , we are targeting to get the close perf as in v2 model paper demonstrate..

Aug 29 '24 12:08 dominicshanshan

deepseek v1 is ready to go, should appear in main branch in early next week, v2 we are still tuning , we are targeting to get the close perf as in v2 model paper demonstrate..

Are there any recent releases? Deepseek v2 is exciting and I can't wait to try it out on trtllm.☺️

Sep 12 '24 10:09 bobbych94

@akhoroshev , deepseek-v1 is live in main branch now, deepseek-v2 target to go live in 10.1 holiday season, thanks for community contribution!

you can still leave comment in this thread but since it fulfilled the purpose I will close for now, thanks!

Sep 25 '24 02:09 dominicshanshan

@akhoroshev , deepseek-v1 is live in main branch now, deepseek-v2 target to go live in 10.1 holiday season, thanks for community contribution!

you can still leave comment in this thread but since it fulfilled the purpose I will close for now, thanks!

nice work!

Sep 27 '24 02:09 fengyang95

Does model opt support fp8 quant for Deepseek v1?

@dominicshanshan

Sep 27 '24 09:09 akhoroshev

Does model opt support fp8 quant for Deepseek v1?

@dominicshanshan

yep, already implemented and passed CI check, should appear in main branch soon

Sep 29 '24 02:09 dominicshanshan

@akhoroshev , deepseek-v1 is live in main branch now, deepseek-v2 target to go live in 10.1 holiday season, thanks for community contribution!

you can still leave comment in this thread but since it fulfilled the purpose I will close for now, thanks!

little update: we cannot get deepseek-v2 ready for main branch in 10.1 holiday season, still working on some bugs during internal testing, will update the status after holiday ..

Sep 30 '24 01:09 dominicshanshan

@akhoroshev , deepseek-v1 is live in main branch now, deepseek-v2 target to go live in 10.1 holiday season, thanks for community contribution! you can still leave comment in this thread but since it fulfilled the purpose I will close for now, thanks!

little update: we cannot get deepseek-v2 ready for main branch in 10.1 holiday season, still working on some bugs during internal testing, will update the status after holiday ..

Hi @dominicshanshan, when will it be released?

Oct 08 '24 08:10 fengyang95

@akhoroshev , deepseek-v1 is live in main branch now, deepseek-v2 target to go live in 10.1 holiday season, thanks for community contribution! you can still leave comment in this thread but since it fulfilled the purpose I will close for now, thanks!

little update: we cannot get deepseek-v2 ready for main branch in 10.1 holiday season, still working on some bugs during internal testing, will update the status after holiday ..

Hi @dominicshanshan, when will it be released?

I will let you know the status on Friday, we found there is precision issue when convert from BF16 -> FP16 and kernel output mismatch, so need some extra work to support BF16 kernel in internal accerlation library ..

Oct 09 '24 10:10 dominicshanshan

@akhoroshev , deepseek-v1 is live in main branch now, deepseek-v2 target to go live in 10.1 holiday season, thanks for community contribution! you can still leave comment in this thread but since it fulfilled the purpose I will close for now, thanks!

little update: we cannot get deepseek-v2 ready for main branch in 10.1 holiday season, still working on some bugs during internal testing, will update the status after holiday ..

Hi @dominicshanshan, when will it be released?

I will let you know the status on Friday, we found there is precision issue when convert from BF16 -> FP16 and kernel output mismatch, so need some extra work to support BF16 kernel in internal accerlation library ..

@dominicshanshan Thank you for your reply. I'm looking forward to the updates. Also, I wanted to ask if support for fp8 is planned?

Oct 10 '24 02:10 fengyang95

@akhoroshev , deepseek-v1 is live in main branch now, deepseek-v2 target to go live in 10.1 holiday season, thanks for community contribution! you can still leave comment in this thread but since it fulfilled the purpose I will close for now, thanks!

little update: we cannot get deepseek-v2 ready for main branch in 10.1 holiday season, still working on some bugs during internal testing, will update the status after holiday ..

Hi @dominicshanshan, when will it be released?

I will let you know the status on Friday, we found there is precision issue when convert from BF16 -> FP16 and kernel output mismatch, so need some extra work to support BF16 kernel in internal accerlation library ..

@dominicshanshan Thank you for your reply. I'm looking forward to the updates. Also, I wanted to ask if support for fp8 is planned?

yes, once we correct the precision issue with BF16 kernel in MLA then FP8 will be enabled, we have to make sure BF16 output is aligned with HF model output first ..

Oct 10 '24 03:10 dominicshanshan

@fengyang95 , as promised, little update: we solved BF16 precision issue and tested output aligned with HF model currently, please bare us to spend some time to package up and to pass internal CI test, probably need one extra week, will update status on Wednesday, apologies again for community developers to wait so long time ..

Oct 11 '24 10:10 dominicshanshan

@fengyang95 , as promised, little update: we solved BF16 precision issue and tested output aligned with HF model currently, please bare us to spend some time to package up and to pass internal CI test, probably need one extra week, will update status on Wednesday, apologies again for community developers to wait so long time ..

please,is there any update?

Oct 23 '24 03:10 WhatGhost