What does this PR do?

Fixes #19982

This pull request adds Mega: Moving Average Equipped Gated Attention, which is the current leader of the LRA benchmark. Adapted from the original fairseq-based repo and used a MLM checkpoint I created using the original implementation on the wikitext-103 dataset. There is no proposed Mega tokenizer, so I used the RoBERTa tokenizer which I used on the wikitext checkpoint. The proposed implementation works in encoder and decoder settings, and all relevant tests are passing.

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[X] Did you read the contributor guideline, Pull Request section?
[X] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
[X] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
[X] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

@ArthurZucker and @younesbelkada for text models; tagging @NielsRogge for visibility as he responded to the original issue.

Feb 23 '23 18:02 mnaylor5

The documentation is not available anymore as the PR was closed or merged.

Feb 23 '23 18:02 HuggingFaceDocBuilderDev

Sorry for the initial test failures! It should be taken care of now. Also I wanted to point out that I did not have access to a GPU while developing this, so I was not able to test on a GPU

Feb 23 '23 20:02 mnaylor5

Hi @mnaylor5 Thanks for your great work on this! Let us know when do you think this is ready for review 💪

Feb 27 '23 08:02 younesbelkada

Thank you @younesbelkada! It is ready for review now 😄

Feb 27 '23 13:02 mnaylor5

Hi @younesbelkada / @ArthurZucker - just checking in to see if there is anything you need from me before reviewing this pull request. Looking forward to being able to use Mega in transformers!

Mar 02 '23 19:03 mnaylor5

Hey! I'll give you a review tomorrow! Sorry for the wait, had to synch with @younesbelkada on this one

Mar 02 '23 20:03 ArthurZucker

Thanks @ArthurZucker, and no worries! 😄

Mar 03 '23 02:03 mnaylor5

Thanks for the review @ArthurZucker! I'll reply to individual comments where I can clear things up, and I'll accept your suggestions wherever I can. I'll probably be able to start on the modifications later today, and if not, then early next week.

Mar 03 '23 15:03 mnaylor5

Alright @ArthurZucker this should be good to review again! The biggest updates in this version are removing the reset_parameters methods in favor of _init_weights, renaming variables/comments to avoid single-letter names, docstring format updates, and renaming Mega to MEGA based on your suggestion. I have resolved the comments where I made the changes, and left the other comments in place for continued discussion.

Thanks again for your feedback, and I'm happy to answer any questions that arise. Looking forward to getting MEGA into the library! 🚀

Mar 08 '23 23:03 mnaylor5

Hi there @ArthurZucker - thanks again for the feedback in your previous review. Just reaching out to see if anything else is needed before reviewing and hopefully merging!

Mar 14 '23 17:03 mnaylor5

Hey! Sorry I must have missed your previous ping! Will review now!

Mar 14 '23 18:03 ArthurZucker

Thanks @ArthurZucker! I appreciate the quick review and the encouragement 😄 I added a couple of questions where things weren't totally clear to me, but I can get started on everything else now. I'm really excited about getting this model into the library, and hopefully there won't be too many more changes required!

Mar 15 '23 19:03 mnaylor5

Will answer to your questions tomorrow!

Mar 15 '23 22:03 ArthurZucker

Alright @ArthurZucker, I think that's everything except the threads with ongoing discussion. I'm super happy with how this is shaping up! In the latest batch of commits:

Renamed classes, variables, and params based on comments (mainly in EMA and MovingAverageGatedAttention class)
Rearranged positional bias, normalization functions, activation functions, dropout classes
Added the copied from comments where requested
Added token type ID buffer
Added tests for generation and sequence classification
Moved FFT convolution into a reusable method with additional documentation
Addressed merge conflicts from LLaMA 🦙

Thanks for the feedback and I'll wait on any more changes until you get a chance to review the updates and resolve the open discussions. Excited to get up and running with MEGA in transformers 🚀 🤗

Mar 17 '23 21:03 mnaylor5

@ArthurZucker as an update, it looks like the fix for left-padding is going to be a more significant effort to implement -- the relative bias is applied in the attention function, and it expects all of the inputs to be left-to-right starting at position 0. We can probably refactor to accept the position IDs like they did for CodeGen, but we'll also need to change how the bias is added since it is currently using a single (seq_len, seq_len) tensor for the entire batch. Refactoring that might be the heavier lift, but I'm still exploring.

I'll dig more into this tomorrow, but for the meantime, I've pushed updates that address the rest of your comments! If you have any other suggestions on the fix for relative positions, I'd love to hear them! 😄

Mar 21 '23 21:03 mnaylor5

Sure! Also it's not that important to have left padding in this PR, can be added in another PR!

Mar 22 '23 11:03 ArthurZucker

Thanks @ArthurZucker! After digging into it, I do think it will require a pretty significant refactor to support left-padding in this PR. If you're comfortable with it, I agree that it could make sense in a new PR. I just added an entry in the MegaBlock docstring for the new causal_mask coming from the pretrained model's method, and added a missing device for the token type IDs.

Also pulled latest changes from main to hopefully prevent whatever was causing the tests for exotic models to fail. I'm really happy with how this is looking, so let me know if there's anything else needed to move forward with this PR! Appreciate your comments and guidance on everything so far! :rocket:

Mar 22 '23 17:03 mnaylor5

Awesome, it's alright with me to leave this to another PR. Will do my final review before pinging @sgugger for another pair of eyes!

Mar 23 '23 10:03 ArthurZucker

Thanks again @ArthurZucker and @sgugger! Appreciate the feedback, and it should all be addressed in the latest changes 🤗

Mar 23 '23 17:03 mnaylor5

Great working with you @mnaylor5 ! Congrats again on the merge 🔥

Mar 27 '23 11:03 ArthurZucker

Congrats @mnaylor5 ! Feel free to share on social media and we'll amplify your post

Mar 27 '23 11:03 NielsRogge

Thanks so much @ArthurZucker and @NielsRogge! I learned a ton through this process, and it's so rewarding to see my code in a library I use so much :heart:

I posted something here on LinkedIn a couple days ago - I'll tag you guys in the comments as well! https://www.linkedin.com/posts/mitchnaylor_mega-activity-7045103140890660864-9VOU

Mar 27 '23 14:03 mnaylor5

Add Mega: Moving Average Equipped Gated Attention

What does this PR do?

Before submitting

Who can review?