FasterTransformer [Long seq length] GPT Seq length constrain

It seems that maximum seq length supported is 4096 for GPT:

(https://github.com/NVIDIA/FasterTransformer/blob/main/src/fastertransformer/models/multi_gpu_gpt/ParallelGptDecoderLayerWeight.h#L31)

Bert seems also have the same maximum seq length.

May I ask the following questions:

where the constrain comes from, i.e., which kernel.
Do you have a plan to support longer seq length?

Thanks!

Aug 25 '23 18:08 zhen-jia

Hi, for 1 question, there seems to be a constraint on softmax.

https://github.com/NVIDIA/FasterTransformer/issues/663

Aug 28 '23 04:08 lkm2835

Hello, I see it. I'm trying to modify, in order to remove this constraint. But it seems to have problems with precision. Does anyone know the specific reason for using this constraint?

template<typename T, typename T_IN>
void invokeMaskedSoftmax(MaskedSoftmaxParam<T, T_IN>& param, cudaStream_t stream)
{
    // attention_score,    (batch_size, head_num, q_length, k_length), softmax output.
    // qk,                 (batch_size, head_num, q_length, k_length), QK^T.
    // attention_mask,     (batch_size, q_length, k_length), attention mask.
    // linear_bias_slopes, (head_num,) the slopes of the linear position bias.

    dim3 grid(param.q_length, param.batch_size, param.num_heads);
    if (param.batch_size * param.num_heads > 360) {
        grid.x = ceil(float(param.q_length) / 32.0f);
    }

    bool is_half2 = sizeof(T) == 2 && sizeof(T_IN) == 2 && param.k_length % 2 == 0;
    dim3 block((param.k_length / (is_half2 ? 2 : 1) + 31) / 32 * 32);
    // Added modification
    if (block.x > 4096 && block.x <= 8192) {
        LAUNCH_MAKSED_SOFTMAX(8)
    }
    else if (block.x > 2048 && block.x <= 4096) {
        LAUNCH_MAKSED_SOFTMAX(4)
    }
    else if (block.x > 1024) {
        LAUNCH_MAKSED_SOFTMAX(2)
    }
    else if (block.x > 0) {
        LAUNCH_MAKSED_SOFTMAX(1)
    }
    else {
        FT_CHECK(param.k_length <= 4096);
    }
}

Aug 28 '23 07:08 StarrickLiu

Hello, I see it. I'm trying to modify, in order to remove this constraint. But it seems to have problems with precision. Does anyone know the specific reason for using this constraint?

template<typename T, typename T_IN>
void invokeMaskedSoftmax(MaskedSoftmaxParam<T, T_IN>& param, cudaStream_t stream)
{
    // attention_score,    (batch_size, head_num, q_length, k_length), softmax output.
    // qk,                 (batch_size, head_num, q_length, k_length), QK^T.
    // attention_mask,     (batch_size, q_length, k_length), attention mask.
    // linear_bias_slopes, (head_num,) the slopes of the linear position bias.

    dim3 grid(param.q_length, param.batch_size, param.num_heads);
    if (param.batch_size * param.num_heads > 360) {
        grid.x = ceil(float(param.q_length) / 32.0f);
    }

    bool is_half2 = sizeof(T) == 2 && sizeof(T_IN) == 2 && param.k_length % 2 == 0;
    dim3 block((param.k_length / (is_half2 ? 2 : 1) + 31) / 32 * 32);
    // Added modification
    if (block.x > 4096 && block.x <= 8192) {
        LAUNCH_MAKSED_SOFTMAX(8)
    }
    else if (block.x > 2048 && block.x <= 4096) {
        LAUNCH_MAKSED_SOFTMAX(4)
    }
    else if (block.x > 1024) {
        LAUNCH_MAKSED_SOFTMAX(2)
    }
    else if (block.x > 0) {
        LAUNCH_MAKSED_SOFTMAX(1)
    }
    else {
        FT_CHECK(param.k_length <= 4096);
    }
}

Taking back what I just said, it seems that there is no problem with this change, at least compared with HuggingFace on a context sample close to 6K. Awaiting further testing.

Aug 28 '23 12:08 StarrickLiu

Thanks @lkm2835 and @StarWorkXc.

Aug 28 '23 16:08 zhen-jia

Hi @StarWorkXc, I also tested with prompts length >7K, with similar modifications in my repo. The results are reasonable. I am thinking to submit a PR on that. What do you think?

Sep 01 '23 21:09 zhen-jia

Hi @StarWorkXc, I also tested with prompts length >7K, with similar modifications in my repo. The results are reasonable. I am thinking to submit a PR on that. What do you think?

我觉得可以，你提交就好。我还在尝试让它支持16k，但是比较麻烦。

Sep 02 '23 13:09 StarrickLiu

Sent a PR. Could someone help to review.

Sep 04 '23 03:09 zhen-jia

我觉得可以，你提交就好。我还在尝试让它支持16k，但是比较麻烦。 @StarWorkXc 有啥好办法扩招到16k 或者更高么？

Sep 27 '23 23:09 zhen-jia

我觉得可以，你提交就好。我还在尝试让它支持16k，但是比较麻烦。 @StarWorkXc 有啥好办法扩招到16k 或者更高么？

限制主要在： src/fastertransformer/kernels/decoder_masked_multihead_attention/decoder_masked_multihead_attention_template.hpp 如果GPU架构版本够高，就修改一下Kernel参数，让kernel支持当前架构的最大share memory 如果GPU不新，share memory最大是48KB，就尝试下面两个方向的修改： 1、参考Flash Attention，修改decoder_masked_multihead_attention_template的核心逻辑 2、把Attention Score存在Share Memory中的数据类型从fp32改为fp16 第二点比较简单，目前我的修改版本在96KB的ShareMemory的GPU上，最高支持32K的上下文。

Oct 07 '23 07:10 StarrickLiu

@StarWorkXc 很好的思路。方便share一下修改后的 branch 么？

Oct 09 '23 06:10 zhen-jia

TensorRT-LLM is released and FasterTransformer development has transitioned to TensorRT-LLM.

TensorRT-LLM also supports very long sequence length, please try that.

Oct 20 '23 07:10 byshiue

TensorRT-LLM is released and FasterTransformer development has transitioned to TensorRT-LLM.

TensorRT-LLM also supports very long sequence length, please try that.

@byshiue Did you mean FasterTransformer will be replaced by TensorRT-LLM and fastertransformer_backend will be replaced by tensorrtllm_backend ?

Oct 20 '23 07:10 rawk-v

Yes.

Oct 20 '23 07:10 byshiue

Thanks @byshiue, I will try TRT-LLM

Oct 20 '23 22:10 zhen-jia