Add SDPA and FlashAttention support to T5
I made some changes to the T5 modeling file to support new attention interface. I made a bit of rearrangements to employ position_bias correctly into the attention mask.
Fixes #26350
A note though, I made a make fix-copies , however it broke several related models such as longt5 and mt5. Somehow fix script didn't copy over the imports, couldn't grab the attention code correctly hence I skipped that part. If applicable we can merge this PR + I can work on related models in another PR or I'm happy to take some hints to make the script work properly.
Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [ x] Did you read the contributor guideline, Pull Request section?
- [ x] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
- [ x] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
- [ x] Did you write any new necessary tests?
@ArthurZucker @Cyrilvallez @vasqu
Sorry to be so strict about this but T5 is not a good candidate for flash attention / sdpa. The reason is that the relative attention bias has to be modeled there and as of now, it's not possible with base flash attention (might be possible with sdpa but needs proper mask preparation). tl;dr: It will only support eager attention in the end
We can still refactor this to have the attention interface-like implementation but only for eager in the end (i.e.
_supports_sdpa/flash_attnremain False). Wdyt?
Sounds reasonable to me!
Heys again @vasqu , I made the changes for restricting only eager attention. Model tests are passing, only repo consistency checks fail as I mentioned above. PR is ready for merge 😊
Heys @vasqu , thanks for your detailed review and suggestions. I made the changes, please have a newer look 😊 I also run some rounds of T5ForConditionalGeneration.generate on CPU and GPU with t5-small and t5-base to double check the functionality. I examined encoder outputs separately again to check attention implementation, all looks good.
run-slow: t5
This comment contains run-slow, running the specified jobs:
models: ["models/t5"] quantizations: []
@DuyguA I've refactored myself because it involves quite a few things and I also had to backpaddle a bit on what I said before. Now everything works for T5 (+ it supports SDPA). However, we need to now fix the other broken tests that relied on the code of T5 by copying from it or using it in some other manner.
I will leave it for now. Would be nice if you could continue from here or I pick it up at some other time. It should at least provide a good basis
@DuyguA I've refactored myself because it involves quite a few things and I also had to backpaddle a bit on what I said before. Now everything works for T5 (+ it supports SDPA). However, we need to now fix the other broken tests that relied on the code of T5 by copying from it or using it in some other manner.
I will leave it for now. Would be nice if you could continue from here or I pick it up at some other time. It should at least provide a good basis
Great, thanks @vasqu . I'll take it from here, hope to finish in couple of days.
[For maintainers] Suggested jobs to run (before merge)
run-slow: mt5, t5
View the CircleCI Test Summary for this PR:
https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=42453&sha=405a57