TransformerEngine thd format is not supported with hierarchical CP implementation yet

Is your feature request related to a problem? Please describe. ulysess sp + ring attention gives a good performance in SFT/RL training, which is called hierarchical CP here. But it doesn't support qkv_format 'thd' for packing now. Packing sequence is also a way to gain a good throughput.

[rank0]:   File "/opt/conda/lib/python3.10/site-packages/transformer_engine/pytorch/attention/dot_product_attention/backends.py", line 659, in forward
[rank0]:     output = attn_forward_func_with_cp(
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/transformer_engine/pytorch/attention/dot_product_attention/context_parallel.py", line 3619, in attn_forward_func_with_cp
[rank0]:     out = AttnFuncWithCPAndKVP2P.apply(*args)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply
[rank0]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/transformer_engine/pytorch/attention/dot_product_attention/context_parallel.py", line 469, in forward
[rank0]:     qkv_format != "thd"
[rank0]: AssertionError: thd format is not supported with hierarchical CP implementation yet!

platform H800 pytorch 2.7 megatron-lm branch core_r0.13.0 transformer_engine 2.4.0 Describe the solution you'd like

I'm not very clear for now.

Describe alternatives you've considered

Closing packing maybe solves the error, but it will influence loss convergence.

Additional context

Add any other context or screenshots about the feature request here.

Sep 28 '25 13:09 stormchasingg

@cyanguwa @xrennvidia Any comments on this?

Oct 07 '25 18:10 ptrendx

yeah, it has not been supported yet. hierarchical CP has comm type of a2a+p2p, but THD+CP cannot with a2a yet, so the hierarchical CP cannot with with THD format. We will discuss the support internally, thanks for bringing this up to us.

Oct 08 '25 02:10 xrennvidia

@xrennvidia thx for ur reply. May I ask another question, which one of a2a and p2p communication method is faster ? Commonly ulysses sp is faster than ring attention but the result of test file tests/pytorch/attention/test_attention_with_cp.py is not.

# a2a
[INFO] rank:1 CP attn time cost: 2.8221261501312256 seconds
# p2p
[INFO] rank:1 CP attn time cost: 2.388253927230835 seconds

Oct 10 '25 06:10 stormchasingg

@stormchasingg the unit test is only for functionality check, not for perf.

I would not say Ulysses (i.e., a2a) is always faster than p2p. If you have long sequences, P2P comm should be fully overlapped, but A2A is always exposed, so P2P should be better for long sequence, such as 128K length with CP4, etc. However if your sequence length is kind of short, but you want to scale out with big CP size, such as 16K length with CP8, then I guess a2a is probably better because its communication volume is smaller and P2P cannot be fully overlapped.

Hence, the best comm type is case by case, no determined answer. We provide all implementations so that it's more convenient for users to choose the best one for their use cases.

Oct 10 '25 07:10 xrennvidia

Hi @xrennvidia - thank you for the prompt response to @stormchasingg . Would you please provide an estimate when THD format is expected to be supported on so we can tag it with the appropriate release version and show it in the roadmap? We want to open visibility to our roadmap to the community whenever possible.

Oct 19 '25 21:10 nvMelissa

@sudhakarsingh27 has a PR which enables THD+C++A2A (refer here). I do not know when this PR can be merged. Later on, we might be able to extend the support to A2A+P2P.

I will defer this to @sudhakarsingh27 and @cyanguwa. It would be better that to assign this to them.

Oct 20 '25 09:10 xrennvidia