petals max token input token length

[in consultation with @mryab]

The max input token length is 2048 right now. It would be nice to process more than 2048 tokens through the distributed BLOOM. Increasing the max input token length would help me a lot in my research.

@mryab @borzunov @justheuristic

Dec 10 '22 11:12 barthfab

Sorry for taking so long to respond, we were a bit overwhelmed last week.

Can you please clarify: do you need >2048 tokens for both forward/backward or inference, or just inference? If inference, we can transition to 4096 or more tokens with a single line code change. If forward/backward, i can look into that and figure out how hard it's gonna be

Dec 21 '22 18:12 justheuristic

Just inference is fine for in-context learning! Thanks a lot

Dec 22 '22 11:12 barthfab

#include stdsorryfortakingdaystorespond.h

We will increase it in the next major release (eta ~1st-3rd jan~) and post an update to this issue.

update: will take a bit longer - we need to get a few more things done in that release. We'll keep you updated in this issue.

Dec 26 '22 19:12 justheuristic

I believe that I'm running into the same problem with the chat app. After a certain length, every conversation ends with the session crashing.

It doesn't appear that I can truncate conversations to "the most recent X number of characters/tokens," because history is saved within the open session (if I'm understanding correctly), and that's a Petals thing. There's nowhere in the chat app for me to fix this.

I'd be perfectly fine with chopping-off the beginning of the conversation history, to keep the total length under some maximum. I know it isn't ideal - but the user experience of "every conversation ending with a crash" is pretty bad, too.

Just posting my thoughts here, for posterity.

Feb 19 '23 18:02 Vectorrent

Hi @barthfab @LuciferianInk,

We extended the context length to 8192 for the latest models that use multi-query attention (Llama 2, StableBeluga 2, CodeLlama, etc.). Feel free to reopen this if it is not enough and the issue is still relevant for you.

Aug 30 '23 04:08 borzunov

Hi @barthfab @LuciferianInk,

We extended the context length to 8192 for the latest models that use multi-query attention (Llama 2, StableBeluga 2, CodeLlama, etc.). Feel free to reopen this if it is not enough and the issue is still relevant for you.

I think it's about to be an issue for me as I'm about to experiment with deploying some models that have 128k token length to petals. If len for inference is quick to modify, that's a start.

Sep 04 '23 20:09 TomExMachina

No commts or PR were linked here or I might attempt a new PR myself.

Sep 04 '23 20:09 TomExMachina