Is Infini-attention support possible?
Prerequisites
Please answer the following questions for yourself before submitting an issue.
- [ x] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- [x ] I carefully followed the README.md.
- [ x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [x ] I reviewed the Discussions, and have a new bug or useful enhancement to share.
Feature Description
Infini-attention as described here... https://arxiv.org/pdf/2404.07143 There is a python implementation here: https://github.com/mustafaaljadery/gemma-2B-10M
Motivation
This new attention mechanism allows an effectively unlimited context without the quadratic penalty. There is a proof of concept with 10M context in < 32GB of RAM. I feel like this would be extremely useful to support but I'm uncertain what if any changes to llama.cpp would be required.
Possible Implementation
There is a python implementation here: https://github.com/mustafaaljadery/gemma-2B-10M
Thanks so much for looking!
The paper describes a new model architecture which would have to be implemented which takes some work.
The model they released with the paper is a "very early checkpoint" so it might be wise to wait until at least one fully baked model exists in this architecture.
It's a very cool model though so it might be worth it.
This issue was closed because it has been inactive for 14 days since being marked as stale.