FasterTransformer icon indicating copy to clipboard operation
FasterTransformer copied to clipboard

[Long seq length] GPT Seq length constrain

Open zhen-jia opened this issue 2 years ago • 14 comments

It seems that maximum seq length supported is 4096 for GPT:

(https://github.com/NVIDIA/FasterTransformer/blob/main/src/fastertransformer/models/multi_gpu_gpt/ParallelGptDecoderLayerWeight.h#L31)

Bert seems also have the same maximum seq length.

May I ask the following questions:

  1. where the constrain comes from, i.e., which kernel.
  2. Do you have a plan to support longer seq length?

Thanks!

zhen-jia avatar Aug 25 '23 18:08 zhen-jia

Hi, for 1 question, there seems to be a constraint on softmax.

https://github.com/NVIDIA/FasterTransformer/issues/663

lkm2835 avatar Aug 28 '23 04:08 lkm2835

Hello, I see it. I'm trying to modify, in order to remove this constraint. But it seems to have problems with precision. Does anyone know the specific reason for using this constraint?

template<typename T, typename T_IN>
void invokeMaskedSoftmax(MaskedSoftmaxParam<T, T_IN>& param, cudaStream_t stream)
{
    // attention_score,    (batch_size, head_num, q_length, k_length), softmax output.
    // qk,                 (batch_size, head_num, q_length, k_length), QK^T.
    // attention_mask,     (batch_size, q_length, k_length), attention mask.
    // linear_bias_slopes, (head_num,) the slopes of the linear position bias.

    dim3 grid(param.q_length, param.batch_size, param.num_heads);
    if (param.batch_size * param.num_heads > 360) {
        grid.x = ceil(float(param.q_length) / 32.0f);
    }

    bool is_half2 = sizeof(T) == 2 && sizeof(T_IN) == 2 && param.k_length % 2 == 0;
    dim3 block((param.k_length / (is_half2 ? 2 : 1) + 31) / 32 * 32);
    // Added modification
    if (block.x > 4096 && block.x <= 8192) {
        LAUNCH_MAKSED_SOFTMAX(8)
    }
    else if (block.x > 2048 && block.x <= 4096) {
        LAUNCH_MAKSED_SOFTMAX(4)
    }
    else if (block.x > 1024) {
        LAUNCH_MAKSED_SOFTMAX(2)
    }
    else if (block.x > 0) {
        LAUNCH_MAKSED_SOFTMAX(1)
    }
    else {
        FT_CHECK(param.k_length <= 4096);
    }
}

StarrickLiu avatar Aug 28 '23 07:08 StarrickLiu

Hello, I see it. I'm trying to modify, in order to remove this constraint. But it seems to have problems with precision. Does anyone know the specific reason for using this constraint?

template<typename T, typename T_IN>
void invokeMaskedSoftmax(MaskedSoftmaxParam<T, T_IN>& param, cudaStream_t stream)
{
    // attention_score,    (batch_size, head_num, q_length, k_length), softmax output.
    // qk,                 (batch_size, head_num, q_length, k_length), QK^T.
    // attention_mask,     (batch_size, q_length, k_length), attention mask.
    // linear_bias_slopes, (head_num,) the slopes of the linear position bias.

    dim3 grid(param.q_length, param.batch_size, param.num_heads);
    if (param.batch_size * param.num_heads > 360) {
        grid.x = ceil(float(param.q_length) / 32.0f);
    }

    bool is_half2 = sizeof(T) == 2 && sizeof(T_IN) == 2 && param.k_length % 2 == 0;
    dim3 block((param.k_length / (is_half2 ? 2 : 1) + 31) / 32 * 32);
    // Added modification
    if (block.x > 4096 && block.x <= 8192) {
        LAUNCH_MAKSED_SOFTMAX(8)
    }
    else if (block.x > 2048 && block.x <= 4096) {
        LAUNCH_MAKSED_SOFTMAX(4)
    }
    else if (block.x > 1024) {
        LAUNCH_MAKSED_SOFTMAX(2)
    }
    else if (block.x > 0) {
        LAUNCH_MAKSED_SOFTMAX(1)
    }
    else {
        FT_CHECK(param.k_length <= 4096);
    }
}

Taking back what I just said, it seems that there is no problem with this change, at least compared with HuggingFace on a context sample close to 6K. Awaiting further testing.

StarrickLiu avatar Aug 28 '23 12:08 StarrickLiu

Thanks @lkm2835 and @StarWorkXc.

zhen-jia avatar Aug 28 '23 16:08 zhen-jia

Hi @StarWorkXc, I also tested with prompts length >7K, with similar modifications in my repo. The results are reasonable. I am thinking to submit a PR on that. What do you think?

zhen-jia avatar Sep 01 '23 21:09 zhen-jia

Hi @StarWorkXc, I also tested with prompts length >7K, with similar modifications in my repo. The results are reasonable. I am thinking to submit a PR on that. What do you think?

我觉得可以,你提交就好。我还在尝试让它支持16k,但是比较麻烦。

StarrickLiu avatar Sep 02 '23 13:09 StarrickLiu

Sent a PR. Could someone help to review.

zhen-jia avatar Sep 04 '23 03:09 zhen-jia

我觉得可以,你提交就好。我还在尝试让它支持16k,但是比较麻烦。 @StarWorkXc 有啥好办法扩招到16k 或者更高么?

zhen-jia avatar Sep 27 '23 23:09 zhen-jia

我觉得可以,你提交就好。我还在尝试让它支持16k,但是比较麻烦。 @StarWorkXc 有啥好办法扩招到16k 或者更高么?

限制主要在: src/fastertransformer/kernels/decoder_masked_multihead_attention/decoder_masked_multihead_attention_template.hpp 如果GPU架构版本够高,就修改一下Kernel参数,让kernel支持当前架构的最大share memory 如果GPU不新,share memory最大是48KB,就尝试下面两个方向的修改: 1、参考Flash Attention,修改decoder_masked_multihead_attention_template的核心逻辑 2、把Attention Score存在Share Memory中的数据类型从fp32改为fp16 第二点比较简单,目前我的修改版本在96KB的ShareMemory的GPU上,最高支持32K的上下文。

StarrickLiu avatar Oct 07 '23 07:10 StarrickLiu

@StarWorkXc 很好的思路。 方便share一下修改后的 branch 么?

zhen-jia avatar Oct 09 '23 06:10 zhen-jia

TensorRT-LLM is released and FasterTransformer development has transitioned to TensorRT-LLM.

TensorRT-LLM also supports very long sequence length, please try that.

byshiue avatar Oct 20 '23 07:10 byshiue

TensorRT-LLM is released and FasterTransformer development has transitioned to TensorRT-LLM.

TensorRT-LLM also supports very long sequence length, please try that.

@byshiue Did you mean FasterTransformer will be replaced by TensorRT-LLM and fastertransformer_backend will be replaced by tensorrtllm_backend ?

rawk-v avatar Oct 20 '23 07:10 rawk-v

Yes.

byshiue avatar Oct 20 '23 07:10 byshiue

Thanks @byshiue, I will try TRT-LLM

zhen-jia avatar Oct 20 '23 22:10 zhen-jia