Add Mega: Moving Average Equipped Gated Attention
What does this PR do?
Fixes #19982
This pull request adds Mega: Moving Average Equipped Gated Attention, which is the current leader of the LRA benchmark. Adapted from the original fairseq-based repo and used a MLM checkpoint I created using the original implementation on the wikitext-103 dataset. There is no proposed Mega tokenizer, so I used the RoBERTa tokenizer which I used on the wikitext checkpoint. The proposed implementation works in encoder and decoder settings, and all relevant tests are passing.
Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [X] Did you read the contributor guideline, Pull Request section?
- [X] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
- [X] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
- [X] Did you write any new necessary tests?
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
@ArthurZucker and @younesbelkada for text models; tagging @NielsRogge for visibility as he responded to the original issue.
The documentation is not available anymore as the PR was closed or merged.
Sorry for the initial test failures! It should be taken care of now. Also I wanted to point out that I did not have access to a GPU while developing this, so I was not able to test on a GPU
Hi @mnaylor5 Thanks for your great work on this! Let us know when do you think this is ready for review 💪
Thank you @younesbelkada! It is ready for review now 😄
Hi @younesbelkada / @ArthurZucker - just checking in to see if there is anything you need from me before reviewing this pull request. Looking forward to being able to use Mega in transformers!
Hey! I'll give you a review tomorrow! Sorry for the wait, had to synch with @younesbelkada on this one
Thanks @ArthurZucker, and no worries! 😄
Thanks for the review @ArthurZucker! I'll reply to individual comments where I can clear things up, and I'll accept your suggestions wherever I can. I'll probably be able to start on the modifications later today, and if not, then early next week.
Alright @ArthurZucker this should be good to review again! The biggest updates in this version are removing the reset_parameters methods in favor of _init_weights, renaming variables/comments to avoid single-letter names, docstring format updates, and renaming Mega to MEGA based on your suggestion. I have resolved the comments where I made the changes, and left the other comments in place for continued discussion.
Thanks again for your feedback, and I'm happy to answer any questions that arise. Looking forward to getting MEGA into the library! 🚀
Hi there @ArthurZucker - thanks again for the feedback in your previous review. Just reaching out to see if anything else is needed before reviewing and hopefully merging!
Hey! Sorry I must have missed your previous ping! Will review now!
Thanks @ArthurZucker! I appreciate the quick review and the encouragement 😄 I added a couple of questions where things weren't totally clear to me, but I can get started on everything else now. I'm really excited about getting this model into the library, and hopefully there won't be too many more changes required!
Will answer to your questions tomorrow!
Alright @ArthurZucker, I think that's everything except the threads with ongoing discussion. I'm super happy with how this is shaping up! In the latest batch of commits:
- Renamed classes, variables, and params based on comments (mainly in EMA and MovingAverageGatedAttention class)
- Rearranged positional bias, normalization functions, activation functions, dropout classes
- Added the
copied from commentswhere requested - Added token type ID buffer
- Added tests for generation and sequence classification
- Moved FFT convolution into a reusable method with additional documentation
- Addressed merge conflicts from LLaMA 🦙
Thanks for the feedback and I'll wait on any more changes until you get a chance to review the updates and resolve the open discussions. Excited to get up and running with MEGA in transformers 🚀 🤗
@ArthurZucker as an update, it looks like the fix for left-padding is going to be a more significant effort to implement -- the relative bias is applied in the attention function, and it expects all of the inputs to be left-to-right starting at position 0. We can probably refactor to accept the position IDs like they did for CodeGen, but we'll also need to change how the bias is added since it is currently using a single (seq_len, seq_len) tensor for the entire batch. Refactoring that might be the heavier lift, but I'm still exploring.
I'll dig more into this tomorrow, but for the meantime, I've pushed updates that address the rest of your comments! If you have any other suggestions on the fix for relative positions, I'd love to hear them! 😄
Sure! Also it's not that important to have left padding in this PR, can be added in another PR!
Thanks @ArthurZucker! After digging into it, I do think it will require a pretty significant refactor to support left-padding in this PR. If you're comfortable with it, I agree that it could make sense in a new PR. I just added an entry in the MegaBlock docstring for the new causal_mask coming from the pretrained model's method, and added a missing device for the token type IDs.
Also pulled latest changes from main to hopefully prevent whatever was causing the tests for exotic models to fail. I'm really happy with how this is looking, so let me know if there's anything else needed to move forward with this PR! Appreciate your comments and guidance on everything so far! :rocket:
Awesome, it's alright with me to leave this to another PR. Will do my final review before pinging @sgugger for another pair of eyes!
Thanks again @ArthurZucker and @sgugger! Appreciate the feedback, and it should all be addressed in the latest changes 🤗
Great working with you @mnaylor5 ! Congrats again on the merge 🔥
Congrats @mnaylor5 ! Feel free to share on social media and we'll amplify your post
Thanks so much @ArthurZucker and @NielsRogge! I learned a ton through this process, and it's so rewarding to see my code in a library I use so much :heart:
I posted something here on LinkedIn a couple days ago - I'll tag you guys in the comments as well! https://www.linkedin.com/posts/mitchnaylor_mega-activity-7045103140890660864-9VOU