TensorRT-LLM feat: Support cos_sin

This MR contains the following updates:

Handle fuse_pos_embd=True/False and create RotaryEmbedding inside attention module, so that the users don't need to handle it in the modeling files.
Cache cos_sin for unfused rope implementation. If flashinfer is available, use apply_rope_with_cos_sin_cache_inplace instead of apply_rope_inplace. Otherwise, we fallback to pure pytorch implementation, which can support any rope now.
We use create_rope_const_params to create and cache cos_sin_cache for all rope types, including Deepseek yarn rope.

Mar 24 '25 09:03 yuxianq

/bot run --add-multi-gpu-test

Mar 24 '25 09:03 yuxianq

PR_Github #283 [ run ] triggered by Bot

Mar 24 '25 09:03 niukuo

PR_Github #283 [ run ] completed with state FAILURE /LLM/main/L0_MergeRequest_PR pipeline #272 completed with status: 'FAILURE'

Mar 24 '25 11:03 niukuo

/bot run --add-multi-gpu-test

Mar 25 '25 06:03 yuxianq

PR_Github #387 [ run ] triggered by Bot

Mar 25 '25 06:03 niukuo

PR_Github #387 [ run ] completed with state FAILURE /LLM/main/L0_MergeRequest_PR pipeline #345 completed with status: 'FAILURE'

Mar 25 '25 07:03 niukuo

/bot run --add-multi-gpu-test

Mar 25 '25 12:03 yuxianq

PR_Github #430 [ run ] triggered by Bot

Mar 25 '25 12:03 niukuo

PR_Github #430 [ run ] completed with state FAILURE /LLM/main/L0_MergeRequest_PR pipeline #369 completed with status: 'FAILURE'

Mar 25 '25 14:03 niukuo

/bot run --add-multi-gpu-test

Mar 26 '25 04:03 yuxianq

PR_Github #510 [ run ] triggered by Bot

Mar 26 '25 04:03 niukuo

I think I am pinged by mistake, is the review request actually pointed to @litaotju ?

Mar 26 '25 05:03 BestJuly

PR_Github #510 [ run ] completed with state SUCCESS /LLM/main/L0_MergeRequest_PR pipeline #437 completed with status: 'FAILURE'

Mar 26 '25 05:03 niukuo

@yuxianq Can we split this PR to several small PRs? For example, the first item can be a single PR.

Handle fuse_pos_embd=True/False and create RotaryEmbedding inside attention module, so that the users don't need to handle it in the modeling files.

Mar 26 '25 06:03 QiJune

Can we split this PR to several small PRs? For example, the first item can be a single PR.

@QiJune I will have a try. Let me pass the CI first to validate that these features work correctly.

Mar 26 '25 07:03 yuxianq

/bot run --add-multi-gpu-test

Mar 26 '25 08:03 yuxianq

PR_Github #550 [ run ] triggered by Bot

Mar 26 '25 08:03 niukuo

PR_Github #550 [ run ] completed with state SUCCESS /LLM/main/L0_MergeRequest_PR pipeline #469 completed with status: 'FAILURE'

Mar 26 '25 11:03 niukuo

/bot run --disable-fail-fast --add-multi-gpu-test

Mar 26 '25 12:03 yuxianq

PR_Github #584 [ run ] triggered by Bot

Mar 26 '25 12:03 niukuo

PR_Github #584 [ run ] completed with state SUCCESS /LLM/main/L0_MergeRequest_PR pipeline #497 completed with status: 'FAILURE'

Mar 26 '25 15:03 niukuo

/bot run --disable-fail-fast --stage-list "A30-7"

Apr 02 '25 08:04 yuxianq

/bot run --disable-fail-fast --stage-list "A30-7"

Apr 02 '25 09:04 yuxianq

PR_Github #1005 [ run ] triggered by Bot

Apr 02 '25 09:04 tensorrt-cicd

PR_Github #1005 [ run ] completed with state SUCCESS /LLM/main/L0_MergeRequest_PR pipeline #776 (Partly Tested) completed with status: 'FAILURE'

Apr 02 '25 13:04 tensorrt-cicd

/bot run --disable-fail-fast --add-multi-gpu-test

Apr 02 '25 15:04 yuxianq

PR_Github #1030 [ run ] triggered by Bot

Apr 02 '25 15:04 tensorrt-cicd

PR_Github #1030 [ run ] completed with state SUCCESS /LLM/main/L0_MergeRequest_PR pipeline #794 completed with status: 'FAILURE'

Apr 02 '25 18:04 tensorrt-cicd

/bot run --disable-fail-fast --add-multi-gpu-test

Apr 03 '25 07:04 yuxianq

PR_Github #1088 [ run ] triggered by Bot

Apr 03 '25 07:04 tensorrt-cicd

feat: Support cos_sin_cache in all cases.