Adding GPT2 with Multi Query Attention
Adding GPT2 with Multi Query Attention
This PR adds a GPT2 architecture with Multi Query Attention (MQA). With MQA the V,K weights are shared across heads and only Qs are unique which makes it possible to run the model with very large batches.
This is the Architecture used in BigCode's SantaCoder.
There are a few things to do before we can merge the PR:
- add performance improvements suggested by @jlamypoirier
- fix tests:
- there is an issue with
past - there is an issue with loading the tokenizer (i guess missing vocab file in repo?)
- fix the generation examples
- there is an issue with
You can run the tests with:
RUN_SLOW=1 python -m pytest -s -v ./tests/models/gpt2mqa/
cc @bigximik @jlamypoirier @RaymondLi0
To review when ready I tag @ArthurZucker and @younesbelkada.
Regarding tests test_batch_generation and test_batch_generation_2heads. If token initialisation class is changed form GPT2Tokenizer to GPT2TokenizerFast the test passes through until generated tokens assertion. Is it intended behaviour or the loading functionality should have rerouted from the default class?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Closing in favour of #22575