llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Added dynamic context size. This is perfect for servers running llama models as a service.

Open J4e6eR opened this issue 9 months ago • 4 comments

The context size which is used to allocate the space for model execution and KV caches, cannot be modified once the model and context params are initialized. This can be bad for servers running models as the context sizes are bound to increase overtime. With dynamic context size, there is no need to restart the servers once the context size exceeds.

Dynamic context size is achieved by modifying the size of n_ctx in cparams followed by resetting the previous memory to create new memory using memory.reset(model.create_memory(params_mem, cparams));. As new memory is created, the earlier context is deleted, the best way to save and load the state to preserve.

I will add load state feature as a default while performing this operation in next commit.

J4e6eR avatar May 04 '25 08:05 J4e6eR

Next goal is to get a dynamic context size working without the need for resetting memory. Is it possible? Let's see!!

J4e6eR avatar May 05 '25 10:05 J4e6eR

Hey, @ggerganov Please have a look at this, this can be helpful for the servers which might need a dynamic context size, which would prevent it from terminating with errors when the program exceeds the context size. I am currently working on follow-up task which I posted earlier. Furthermore, are there any changes you expect me to do, to improve this commit, I am open for suggestions and improvements. Thank you.

J4e6eR avatar May 07 '25 08:05 J4e6eR

Hi, I am not convinced that this is a useful feature. IMO the application should pre-allocate the worst-case amount of memory that it plans to use. This way, if it is able to start, you have a guarantee that it will keep running without running out of memory at some later point.

I don't see use cases where dynamically adjusting the context has an advantage compared to the existing logic.

ggerganov avatar May 07 '25 09:05 ggerganov

@ggerganov So if the application allocates more amount of memory before hand, what's the significance of context size (n_ctx)? Because earlier when I was testing one of the example codes, probably simple-chat, I did exceed the context size after few back and forth conversations with the model, and it actually terminated the program giving an error message "Context size exceeded".

J4e6eR avatar May 07 '25 10:05 J4e6eR

Finally achieved Dynamic modification of the context size without resetting the memory.

J4e6eR avatar Jul 30 '25 11:07 J4e6eR