Pearl Li
Pearl Li
**Tldr: Add implementations of [FlashAttention](https://arxiv.org/abs/2205.14135) using OpenAI's triton language.** **Background**: - FlashAttention: an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes, 15% end-to-end speedup...
**Patch description** 1. removed all logic inside [._check_final_chat_data()](https://github.com/facebookresearch/ParlAI/blob/989e29ff8d7a9606e2bbc7db7290b58fe9b49017/parlai/crowdsourcing/tasks/model_chat/utils.py#L398) and [._check_output_key()](https://github.com/facebookresearch/ParlAI/blob/989e29ff8d7a9606e2bbc7db7290b58fe9b49017/parlai/crowdsourcing/tasks/model_chat/utils.py#L385) in the class `AbstractModelChatTest` into [._remove_non_deterministic_keys()](https://github.com/facebookresearch/ParlAI/blob/989e29ff8d7a9606e2bbc7db7290b58fe9b49017/parlai/crowdsourcing/tasks/model_chat/utils.py#L338). 2. changed 4 unit tests `test_model_chat.py`, `test_model_image_chat.py`, `test_demo_chat.py`, and `test_qa_data_collection.py` to use pytest regressions,...
add tutorial on how to add a custom cuda C++ kernel to ParlAI
I am trying to debug my fused attention code and would like to print some intermediate values such as the ptrs as well as the values loaded from the pointers....
I am curious why does the fused attention code only with A100? Is there a way to make it work on other GPUs such as Quadro GP100?