Adding GPT2 with Multi Query Attention

Open lvwerra opened this issue 3 years ago • 1 comments

Adding GPT2 with Multi Query Attention

This PR adds a GPT2 architecture with Multi Query Attention (MQA). With MQA the V,K weights are shared across heads and only Qs are unique which makes it possible to run the model with very large batches.

This is the Architecture used in BigCode's SantaCoder.

There are a few things to do before we can merge the PR:

add performance improvements suggested by @jlamypoirier
fix tests:
- there is an issue with past
- there is an issue with loading the tokenizer (i guess missing vocab file in repo?)
- fix the generation examples

You can run the tests with:

RUN_SLOW=1 python -m pytest -s -v ./tests/models/gpt2mqa/

cc @bigximik @jlamypoirier @RaymondLi0

To review when ready I tag @ArthurZucker and @younesbelkada.

Jan 23 '23 11:01 lvwerra

Regarding tests test_batch_generation and test_batch_generation_2heads. If token initialisation class is changed form GPT2Tokenizer to GPT2TokenizerFast the test passes through until generated tokens assertion. Is it intended behaviour or the loading functionality should have rerouted from the default class?

Jan 23 '23 16:01 bigximik

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Apr 21 '23 15:04 github-actions[bot]

Closing in favour of #22575

Apr 21 '23 15:04 lvwerra