Is Infini-attention support possible?

Open sdmorrey opened this issue 1 year ago • 1 comments

Prerequisites

Please answer the following questions for yourself before submitting an issue.

[ x] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
[x ] I carefully followed the README.md.
[ x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[x ] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

Infini-attention as described here... https://arxiv.org/pdf/2404.07143 There is a python implementation here: https://github.com/mustafaaljadery/gemma-2B-10M

Motivation

This new attention mechanism allows an effectively unlimited context without the quadratic penalty. There is a proof of concept with 10M context in < 32GB of RAM. I feel like this would be extremely useful to support but I'm uncertain what if any changes to llama.cpp would be required.

Possible Implementation

There is a python implementation here: https://github.com/mustafaaljadery/gemma-2B-10M

Thanks so much for looking!

May 11 '24 05:05 sdmorrey

The paper describes a new model architecture which would have to be implemented which takes some work.

The model they released with the paper is a "very early checkpoint" so it might be wise to wait until at least one fully baked model exists in this architecture.

It's a very cool model though so it might be worth it.

May 14 '24 19:05 arnfaldur

This issue was closed because it has been inactive for 14 days since being marked as stale.

Jun 28 '24 01:06 github-actions[bot]