Baole Ai
Baole Ai
This is an impressive project! It demonstrates superior performance compared to Ulysses. Have you also conducted comparisons of FastSeq with RingAttention or Context Parallel in the TransformerEngine?
## 🚀 Feature Supports input with dynamic shape. ## Motivation During the training of Large Language Models (LLMs), the sequence lengths of input data are typically variable, necessitating padding prior...
Hi, there is no "Memory" view in tensorboard after set `profile_memory=True`. The test env is : PyTorch1.8.1 + torch_tb_profiler0.4.0.
It seems like benchmark_pipe only support [shm|uv] transport and [basic] channel. Does benchmark_pipe also support ibv transport and cuda channel? Is there a complete example of different transport and channel...
Hi, when I build as follows ``` $ cd tensorpipe $ mkdir build $ cd build $ cmake ../ -GNinja -DTP_BUILD_TESTING=ON $ ninja ``` I got the following errors: FAILED:...
Hi Team, `_flash_attn_forward` now supports fp8 but `_flash_attn_varlen_forward` does not support fp8 yet https://github.com/Dao-AILab/flash-attention/blob/main/hopper/flash_api.cpp#L440. I would like to ask if there are any plans to implement support for _flash_attn_varlen_forward using...
Hi, @Aaryan0404, while reviewing the implementation of the attention kernel in h100.cu, I noticed the following scaling of norm_vec based on the dimension D: ``` if constexpr (D == 64)...