TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

DeepSeek MoE support

Open akhoroshev opened this issue 1 year ago • 14 comments

This PR adds support for DeepSeek MoE https://huggingface.co/deepseek-ai/deepseek-moe-16b-base

Main differences from Mixtral:

  1. Shared experts
  2. First layers are dense
  3. MoE normalization disabled

image

Build:

cd TensorRT-LLM/examples/llama
python convert_checkpoint.py --model_dir /models/deepseek-moe-16b-base/ --dtype float16  --output_dir /trtllm/deepseek-moe-16b-base/1-gpu-tmp/
trtllm-build --checkpoint_dir /trtllm/deepseek-moe-16b-base/1-gpu-tmp/ --output_dir /trtllm/deepseek-moe-16b-base/1-gpu  --max_batch_size 32 --max_input_len 3072  --max_output_len 1024 --max_num_tokens 32768  --gpt_attention_plugin float16  --gemm_plugin float16  --context_fmha  enable  --paged_kv_cache enable   --remove_input_padding  enable  --use_paged_context_fmha enable

Run:

cd TensorRT-LLM/examples/
python run.py --engine_dir /trtllm/deepseek-moe-16b-base/1-gpu --tokenizer_dir /models/deepseek-moe-16b-base/ --max_output_len 32 --top_p 0 --input_text 
"The president of the United States is person who"

TensorRt-LLM Output:

[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024060400
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Input [Text 0]: "<|begin▁of▁sentence|>The president of the United States is person who"
Output [Text 0 Beam 0]: " is elected by the people of the United States to lead the country. The president is the head of the executive branch of the government. The president is the commander"

Transformers Output:

>>> tokenizer.batch_decode(model.generate(torch.LongTensor([tokenizer.encode("The president of the United States is person who")]).cuda(), max_new_tokens=32, do_sample=False))
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:100001 for open-end generation.
['<|begin▁of▁sentence|>The president of the United States is person who is elected by the people of the United States to lead the country. The president is the head of the executive branch of the government. The president is the commander']

akhoroshev avatar Jun 09 '24 18:06 akhoroshev

Thanks @akhoroshev for your contribution to TRT-LLM. My suggestion is to use the dedidacted model definition for the newly added MoE models instead of reuse the llama model. We do have plan to create the unique mixtral and arctic example in the coming release.

I'm not sure such efforts is acceptable for you or not. If u're not willing to refactor code in this way, we can do that later after this MR merged.

nv-guomingz avatar Jun 11 '24 05:06 nv-guomingz

Hi @akhoroshev, first off thanks for the contribution. I agree with @nv-guomingz about having this be a separate model, but also that this is something we could handle separately after this MR.

My second comment is that we have done some work for other shared experts and settled on a slightly different convention for the shared expert design. Instead of modifying the MOE plugin we instead use an unmodified MOE and combine it with an MLP layer for the shared experts at the DecoderLayer level. We are not necessarily committed to one design or the other, so I will discuss with others working on this and decide how best to unify the design with what you have here.

My final note is that I would like to see a more general version of the DenseReplaceConfig that instead takes a list of layers that are marked as dense or moe, and then have a function is_moe_layer(layer_idx) to check.

Please let us know if you are interested in helping with this, otherwise we can look into finding someone to take it forward from here

djns99 avatar Jun 11 '24 05:06 djns99

I agree with @nv-guomingz about having this be a separate model, but also that this is something we could handle separately after this MR.

I agree that DeepSeek and other MoE architectures need a separate folder, and I think the trtllm team will be able to do the job after this MR for example.

My second comment is that we have done some work for other shared experts and settled on a slightly different convention for the shared expert design. Instead of modifying the MOE plugin we instead use an unmodified MOE and combine it with an MLP layer for the shared experts at the DecoderLayer level. We are not necessarily committed to one design or the other, so I will discuss with others working on this and decide how best to unify the design with what you have here.

Ok, I'll wait for the results of the discussion

My final note is that I would like to see a more general version of the DenseReplaceConfig that instead takes a list of layers that are marked as dense or moe, and then have a function is_moe_layer(layer_idx) to check.

I agree that the is_moe_layer function is better. But what about dense_intermidiate_size param? It's ok or we need more general solution?

Please let us know if you are interested in helping with this, otherwise we can look into finding someone to take it forward from here

I'm interested

akhoroshev avatar Jun 11 '24 07:06 akhoroshev

I agree that the is_moe_layer function is better. But what about dense_intermidiate_size param? It's ok or we need more general solution?

This is a good question, perhaps a list of DenseConfig and MOEConfig options would be the best approach that can store the details of the config for each layer. Then we can have two functions: is_moe_layer(layer_idx) -> bool and then corresponding get_layer_config(layer_idx) -> MOEConfig|DenseConfig. Open to suggestions if you have any other ideas.

djns99 avatar Jun 11 '24 23:06 djns99

I agree that the is_moe_layer function is better. But what about dense_intermidiate_size param? It's ok or we need more general solution?

This is a good question, perhaps a list of DenseConfig and MOEConfig options would be the best approach that can store the details of the config for each layer. Then we can have two functions: is_moe_layer(layer_idx) -> bool and then corresponding get_layer_config(layer_idx) -> MOEConfig|DenseConfig. Open to suggestions if you have any other ideas.

@dataclass
class DenseConfig:
  intermediate_size: int
  hidden_act: str


@dataclass
class MoeConfig:
  num_experts: int
  top_k: int
  tp_mode: ParallelismMode
  num_shared_experts: int

  intermediate_size: int
  hidden_act: str


@dataclass
class LayersMLPConfig:
  config: Union[DenseConfig, MoeConfig, List[Union[DenseConfig, MoeConfig]]]

  def get_layer_config(layer_idx) -> Union[DenseConfig, MoeConfig]:
    pass

  def is_moe_layer(layer_idx) -> bool:
    pass

  def is_dense_layer(layer_idx) -> bool:
    return not is_moe_layer(layer_idx)

@djns99 I wrote the classes for your solution above. I want to extend existing MoeConfig with intermediate_size and hidden_act members. Also I want to introduce new DenseConfig and LayersMLPConfig classes. What do you think about it?

akhoroshev avatar Jun 12 '24 13:06 akhoroshev

Thanks @akhoroshev that makes perfect sense to me. Feel free to make that change to this PR if you would like

I discussed re shared experts, and the verdict was that we should implement a SharedExpertsMOE type class that handles this so we can keep the MOE class simple while also having one shared implementation. You wont have to do anything here yet, we will get that integrated internally in the next week or so.

djns99 avatar Jun 13 '24 03:06 djns99

Any progress update on this one?

TheAhmadOsman avatar Jul 06 '24 10:07 TheAhmadOsman

@Ahmad-Magdy-Osman yep, we are working on enable DeepSeek V2 (MoE + MLA) in TRT-LLM v0.12, progress on performance benchmark and optimization at same time

dominicshanshan avatar Jul 07 '24 04:07 dominicshanshan

@dominicshanshan Is this available on the main branch or somewhere else? I can try building the docker image and experiment with it

TheAhmadOsman avatar Jul 07 '24 06:07 TheAhmadOsman

Hi @Ahmad-Magdy-Osman, currently these changes are being tested on our internal branch. Once they are accepted internally they will be released in one of our upcoming weekly releases. We will let you know as soon as they are available

djns99 avatar Jul 07 '24 21:07 djns99

@dominicshanshan

Hello, first of all thanks for your help with this PR. I'm too busy to do this right now.

yep, we are working on enable DeepSeek V2 (MoE + MLA) in TRT-LLM v0.12, progress on performance benchmark and optimization at same time

Will DeepSeek V1 architecture be supported?

akhoroshev avatar Jul 10 '24 14:07 akhoroshev

@dominicshanshan

Hello, first of all thanks for your help with this PR. I'm too busy to do this right now.

yep, we are working on enable DeepSeek V2 (MoE + MLA) in TRT-LLM v0.12, progress on performance benchmark and optimization at same time

Will DeepSeek V1 architecture be supported?

yes

dominicshanshan avatar Jul 12 '24 03:07 dominicshanshan

Any recent updates regarding the support for DeepSeek MOE?

halexan avatar Aug 05 '24 04:08 halexan

Any recent updates regarding the support for DeepSeek MOE?

deepseek MOE hopefully will be appear in main branch in next week, deepseek v2 (MLA+MOE) hopefully will be appear in main at end of month, we are still working hard to improve the MLA kernel performance..

dominicshanshan avatar Aug 05 '24 14:08 dominicshanshan

Any recent updates regarding the support for DeepSeek MOE?

deepseek MOE hopefully will be appear in main branch in next week, deepseek v2 (MLA+MOE) hopefully will be appear in main at end of month, we are still working hard to improve the MLA kernel performance..

Any recent updates regarding the support for DeepSeek MOE?

Xu-Chen avatar Aug 27 '24 08:08 Xu-Chen

Any recent updates regarding the support for DeepSeek MOE?

deepseek MOE hopefully will be appear in main branch in next week, deepseek v2 (MLA+MOE) hopefully will be appear in main at end of month, we are still working hard to improve the MLA kernel performance..

Is it already supported?

bobbych94 avatar Aug 27 '24 10:08 bobbych94

Any recent updates regarding the support for DeepSeek MOE?

deepseek MOE hopefully will be appear in main branch in next week, deepseek v2 (MLA+MOE) hopefully will be appear in main at end of month, we are still working hard to improve the MLA kernel performance..

Any Updates?

Missmiaom avatar Aug 29 '24 07:08 Missmiaom

deepseek v1 is ready to go, should appear in main branch in early next week, v2 we are still tuning , we are targeting to get the close perf as in v2 model paper demonstrate..

dominicshanshan avatar Aug 29 '24 12:08 dominicshanshan

deepseek v1 is ready to go, should appear in main branch in early next week, v2 we are still tuning , we are targeting to get the close perf as in v2 model paper demonstrate..

Are there any recent releases? Deepseek v2 is exciting and I can't wait to try it out on trtllm.☺️

bobbych94 avatar Sep 12 '24 10:09 bobbych94

@akhoroshev , deepseek-v1 is live in main branch now, deepseek-v2 target to go live in 10.1 holiday season, thanks for community contribution!

you can still leave comment in this thread but since it fulfilled the purpose I will close for now, thanks!

dominicshanshan avatar Sep 25 '24 02:09 dominicshanshan

@akhoroshev , deepseek-v1 is live in main branch now, deepseek-v2 target to go live in 10.1 holiday season, thanks for community contribution!

you can still leave comment in this thread but since it fulfilled the purpose I will close for now, thanks!

nice work!

fengyang95 avatar Sep 27 '24 02:09 fengyang95

Does model opt support fp8 quant for Deepseek v1?

@dominicshanshan

akhoroshev avatar Sep 27 '24 09:09 akhoroshev

Does model opt support fp8 quant for Deepseek v1?

@dominicshanshan

yep, already implemented and passed CI check, should appear in main branch soon

dominicshanshan avatar Sep 29 '24 02:09 dominicshanshan

@akhoroshev , deepseek-v1 is live in main branch now, deepseek-v2 target to go live in 10.1 holiday season, thanks for community contribution!

you can still leave comment in this thread but since it fulfilled the purpose I will close for now, thanks!

little update: we cannot get deepseek-v2 ready for main branch in 10.1 holiday season, still working on some bugs during internal testing, will update the status after holiday ..

dominicshanshan avatar Sep 30 '24 01:09 dominicshanshan

@akhoroshev , deepseek-v1 is live in main branch now, deepseek-v2 target to go live in 10.1 holiday season, thanks for community contribution! you can still leave comment in this thread but since it fulfilled the purpose I will close for now, thanks!

little update: we cannot get deepseek-v2 ready for main branch in 10.1 holiday season, still working on some bugs during internal testing, will update the status after holiday ..

Hi @dominicshanshan, when will it be released?

fengyang95 avatar Oct 08 '24 08:10 fengyang95

@akhoroshev , deepseek-v1 is live in main branch now, deepseek-v2 target to go live in 10.1 holiday season, thanks for community contribution! you can still leave comment in this thread but since it fulfilled the purpose I will close for now, thanks!

little update: we cannot get deepseek-v2 ready for main branch in 10.1 holiday season, still working on some bugs during internal testing, will update the status after holiday ..

Hi @dominicshanshan, when will it be released?

I will let you know the status on Friday, we found there is precision issue when convert from BF16 -> FP16 and kernel output mismatch, so need some extra work to support BF16 kernel in internal accerlation library ..

dominicshanshan avatar Oct 09 '24 10:10 dominicshanshan

@akhoroshev , deepseek-v1 is live in main branch now, deepseek-v2 target to go live in 10.1 holiday season, thanks for community contribution! you can still leave comment in this thread but since it fulfilled the purpose I will close for now, thanks!

little update: we cannot get deepseek-v2 ready for main branch in 10.1 holiday season, still working on some bugs during internal testing, will update the status after holiday ..

Hi @dominicshanshan, when will it be released?

I will let you know the status on Friday, we found there is precision issue when convert from BF16 -> FP16 and kernel output mismatch, so need some extra work to support BF16 kernel in internal accerlation library ..

@dominicshanshan Thank you for your reply. I'm looking forward to the updates. Also, I wanted to ask if support for fp8 is planned?

fengyang95 avatar Oct 10 '24 02:10 fengyang95

@akhoroshev , deepseek-v1 is live in main branch now, deepseek-v2 target to go live in 10.1 holiday season, thanks for community contribution! you can still leave comment in this thread but since it fulfilled the purpose I will close for now, thanks!

little update: we cannot get deepseek-v2 ready for main branch in 10.1 holiday season, still working on some bugs during internal testing, will update the status after holiday ..

Hi @dominicshanshan, when will it be released?

I will let you know the status on Friday, we found there is precision issue when convert from BF16 -> FP16 and kernel output mismatch, so need some extra work to support BF16 kernel in internal accerlation library ..

@dominicshanshan Thank you for your reply. I'm looking forward to the updates. Also, I wanted to ask if support for fp8 is planned?

yes, once we correct the precision issue with BF16 kernel in MLA then FP8 will be enabled, we have to make sure BF16 output is aligned with HF model output first ..

dominicshanshan avatar Oct 10 '24 03:10 dominicshanshan

@fengyang95 , as promised, little update: we solved BF16 precision issue and tested output aligned with HF model currently, please bare us to spend some time to package up and to pass internal CI test, probably need one extra week, will update status on Wednesday, apologies again for community developers to wait so long time ..

dominicshanshan avatar Oct 11 '24 10:10 dominicshanshan

@fengyang95 , as promised, little update: we solved BF16 precision issue and tested output aligned with HF model currently, please bare us to spend some time to package up and to pass internal CI test, probably need one extra week, will update status on Wednesday, apologies again for community developers to wait so long time ..

please,is there any update?

WhatGhost avatar Oct 23 '24 03:10 WhatGhost