llama-cpp-python Support LoRA hotswapping and multiple LoRAs at a time

This is a PR to add support for loading and changing LoRA adapters at runtime as introduced into llama.cpp in https://github.com/ggerganov/llama.cpp/pull/8332 by @ngxson. Adding this support should allow things like loading a base model, then swapping adapters in and out to support different features and behaviours. This could be really useful in smaller environments where we might use smaller models but want to support a variety of capabilities. (This appears to be the approach taken by some commercial mobile device makers.)

The list of changes from upstream in https://github.com/ggerganov/llama.cpp/pull/8332 are:

Refactor lora API

Allow hot-swapping lora

Added struct llama_lora_adapter to keep track of loaded lora

I have made some llama-cpp-python changes to enable this support:

Updated C wrappers
Added _internals.LlamaLoraAdapter to wrap llama.cpp's llama_lora_adapter
Modified wrapper lifecycle to free llama_lora_adapter correctly
Added high level API in Llama wrapper - now supports a dict of LoRA adapters to reflect llama.cpp's support for multiple LoRAs; also has method for changing LoRA scales
Updated cache to have LoRA adapter weights in cache keys, because different active LoRAs will have different cache state
Updated server to support hot-swapping LoRAs when a base model is shared

I have an example of usage through the API and via the server here: https://gist.github.com/richdougherty/dd4961a72fafbff5216b4bc9f48b1147#file-lora-md

Example API usage:

>>> import llama_cpp
>>> llm = llama_cpp.Llama("<model>") # Can also add LoRAs in dict here
>>> llm.lora_adapters
>>> llm.create_completion(seed=12345, temperature=0, max_tokens=256, prompt=".....", stop=["\n"])
>>> llm.set_lora_adapter_scale('./adapters/lora_tldr_headline_gen.gguf', 1.0)
>>> llm.lora_adapters
{'./adapters/lora_tldr_headline_gen.gguf': 1.0}
>>> llm.create_completion(seed=12345, temperature=0, max_tokens=256, prompt=".....", stop=["\n"])
>>> llm.set_lora_adapter_scale('./adapters/lora_tldr_headline_gen.gguf', 0.0)
>>> llm.lora_adapters
{'./adapters/lora_tldr_headline_gen.gguf': 0.0}
>>> llm.set_lora_adapter_scale('./adapters/lora_tldr_content_gen.gguf', 1.0)
>>> llm.lora_adapters
{'./adapters/lora_tldr_headline_gen.gguf': 0.0, './adapters/lora_tldr_content_gen.gguf': 1.0}
>>> llm.create_completion(seed=12345, temperature=0, max_tokens=256, prompt=".....", stop=["\n"])

Oct 30 '24 10:10 richdougherty

Still working on this. Just added support to the OpenAI-compatible server for hot-swapping LoRAs via model aliases. This allows fast serving of different LoRA adapters that extend the same base model with minimal switching overhead.

{
    "host": "0.0.0.0",
    "port": 8080,
    "models": [
        {
          "model_alias": "mistral",
          "model": "./mistral-7b-v0.1.Q4_K_S.gguf",
          "verbose": true
        },
        {
          "model_alias": "mistral-magicoder",
          "model": "./mistral-7b-v0.1.Q4_K_S.gguf",
          "lora_adapters": {
            "./magicoder-lora-mistral-7b-v0.1.gguf": 1.0
          },
          "verbose": true
        },
        {
          "model_alias": "mistral-conllpp",
          "model": "./mistral-7b-v0.1.Q4_K_S.gguf",
          "lora_adapters": {
            "./conllpp-lora-mistral-7b-v0.1.gguf": 1.0
          },
          "verbose": true
        }
    ]
}

Then calling the OpenAI compatible API with "model": "mistral, "model": "mistral-magicoder, "model": "mistral-conllpp" will result in a hot-swap, e.g

Hot-swapping model, setting existing LoRA adapter scales to 0.0.
Hot-swapping model, setting LoRA adapter scales for mistral-conllpp.
llama_lora_adapter_init_internal: loading lora adapter from './conllpp-lora-mistral-7b-v0.1.gguf' ...
llama_lora_adapter_init_internal: CPU_Mapped LoRA buffer size =    13.00 MiB
llama_lora_adapter_init_internal: loaded 128 tensors from lora file

Nov 02 '24 00:11 richdougherty

This seems to be a cool feature to have. Any idea when this will be available?

Nov 18 '24 13:11 hrsmanian

The code is pretty much done and working. I plan to tidy it up a little this weekend, ready for review and (hopefully) merge.

Nov 19 '24 09:11 richdougherty

Thanks Rich. Let me know when I can try it out.

Nov 22 '24 05:11 hrsmanian

The code is ready for review now, thanks for your patience!

@hrsmanian , if you want to try it out before it is merged, a guide for usage is here: https://gist.github.com/richdougherty/dd4961a72fafbff5216b4bc9f48b1147

Nov 24 '24 08:11 richdougherty

The code is ready for review now, thanks for your patience!

@hrsmanian , if you want to try it out before it is merged, a guide for usage is here: https://gist.github.com/richdougherty/dd4961a72fafbff5216b4bc9f48b1147

Thanks Rich. Seems to work as expected overall. However, when I run it on 1000's of converasations, i see a constant increase in gpu memory. Could there be a memory leak? Or should we completely disable Lora adapters

Nov 25 '24 12:11 hrsmanian

However, when I run it on 1000's of converasations, i see a constant increase in gpu memory. Could there be a memory leak? Or should we completely disable Lora adapters

That sounds like a memory leak for sure. Can you give me a bit more information about how you're running it? Are you loading (and unloading) LoRA adapters? Are you using the http server or the API? Also any info on how you're measuring memory usage, how fast it is growing, etc might be useful. Thanks!

Nov 27 '24 16:11 richdougherty

@richdougherty , I am running offline inference in a loop loading one conversation at a time And in another window, I run nvidia-smi to see the gpu memory usage. It increases by roughly 4Mb for each inference. I load the model and set the lora as below.

import llama_cpp import json import time

llm = llama_cpp.Llama("gguf_models/Llama-3.2-3B-Instruct-f16.gguf", n_gpu_layers=-1, verbose=False, n_ctx=6000)

#loading both the adapters llm.set_lora_adapter_scale('gguf_models/lora_3b_exp42_f16.gguf', 0.0) llm.set_lora_adapter_scale('gguf_models/lora_3b_exp43_f16.gguf', 0.0)

infile = "val_English.jsonl" fp = open (infile, "r") time1 = 0.0 time2 = 0.0 count = 0 for line in fp: json_data = json.loads(line.strip()) dialogue = json_data['dialogue']

model_prompt = f"Summarize the text"

s1 = time.time()
############## activate 1st adapter and disable 2nd adapter
llm.set_lora_adapter_scale('gguf_models/lora_3b_exp42_f16.gguf', 1.0)
llm.set_lora_adapter_scale('gguf_models/lora_3b_exp43_f16.gguf', 0.0)
print(f"LoRA State1: {llm.lora_adapters}")

prompt_1 = f"""<|start_header_id|>system<|end_header_id|>

{model_prompt}: {dialogue}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

""" output = llm.create_completion(seed=12345, temperature=0.01, top_p=0.99, top_k=250, max_tokens=256, prompt=prompt_1, stop=['<|eot_id|>']) time1 += time.time() - s1 print_out1 = f"""MODEL OUTPUT:\n\n {output['choices'][0]['text']}

Usage: 
Input Tokens: {output['usage']['prompt_tokens']}
Output Tokens: {output['usage']['completion_tokens']}
Total Tokens: {output['usage']['total_tokens']}

"""
print(print_out1)

Nov 27 '24 17:11 hrsmanian

Hi, anymore information needed on this? Kindly let me know

Dec 06 '24 00:12 hrsmanian

Hi @hrsmanian, apologies for the time to get back. Based on your explanation - seeing GPU memory usage increasing - it sounds like a leak in the llama.cpp allocated GPU memory. This could indicate either a bug in llama.cpp's LoRA adapter code or - more likely! - a bug in the bindings that I wrote.

A bug in the llama-cpp-python bindings in this PR would be something like incorrectly using the llama.cpp API and therefore causing those extra GPU allocations for LoRA adapters and perhaps forgetting to deallocate somehow.

Unfortunately, I only have a CPU for inference, but I believe I should be able to spot an incorrect usage of the llama.cpp LoRA API by seeing RAM usage increasing for the Python process. This is because llama.cpp will use the process RAM to store the models and adapters when using CPU inference. So because I'm using system RAM then I think that any mistakes with allocating or deallocating LoRA adapters should show up in the Python RAM usage directly as an increase in virtual memory usage.

Note: a Python object memory leak in the wrapper objects might not show up, since the Python heap might not visibly grow with each object due to its garbage-collected nature. But I think that looking at Python process RAM usage should be good enough to show the kind of leak that you're talking about in llama.cpp allocated memory. The memory allocated by llama.cpp won't be using the Python garbage-collected heap for its memory allocation, therefore it should be allocating memory directly which will be visible in the process memory used.

Test script

I've written a little script to test this. I'm using psutil to check the Python process RAM usage after each operation. This works for me for testing a leak when using CPU for inference.

I do not see a leak using this to test. Certainly nothing on the order of 4MB for each inference. Would you mind checking the script on your machine as well? I can also test your script if you want to link to your models / adapters and dataset you use to test, but I understand that these might be private.

You can try it using psutil to view RAM usage. To help with the GPU memory leak debugging perhaps you could edit the script so that it reports GPU usage. I saw there are a couple of libraries that can help with this (not tested by me). These are https://github.com/pmav99/nvsmi or https://pypi.org/project/nvidia-ml-py/ . You could patch the log_mem function to report GPU memory usage. That could give a useful log showing any leaks in GPU memory.

For my tests I am using the model and adapters described in the guide I wrote before: https://gist.github.com/richdougherty/dd4961a72fafbff5216b4bc9f48b1147 . This test only tests very short prompts and completions. I tried a slightly larger prompt and completion further down in this comment, but perhaps you can test using your dataset as well.

export MODEL_GGUF=$(huggingface-cli download TheBloke/Mistral-7B-v0.1-GGUF mistral-7b-v0.1.Q4_K_S.gguf)
export ADAPTER1_GGUF=./adapters/lora_tldr_headline_gen.gguf 
export ADAPTER2_GGUF=./adapters/lora_tldr_content_gen.gguf

pip install psutil

python memtest.py 2>&1

memtest.py

import os

model_gguf = os.environ['MODEL_GGUF']
adapter1_gguf = os.environ['ADAPTER1_GGUF']
adapter2_gguf = os.environ['ADAPTER2_GGUF']

import psutil

process = psutil.Process()

prev_vms = 0
prev_rss = 0
def log_mem(msg):
  global prev_rss, prev_vms
  pmem = process.memory_info()
  vms = pmem.vms
  rss = pmem.rss
  delta_vms = vms - prev_vms
  delta_rss = rss - prev_rss

  print(f'====== {msg:<40} {vms:>16,} ({delta_vms:>+16,}) {rss:>16,} ({delta_rss:>+16,}) ======')

  prev_vms = vms
  prev_rss = rss

log_mem('initial')

import llama_cpp

log_mem('imported llama_cpp')

llm = llama_cpp.Llama(model_gguf)

log_mem('loaded model')

i = 0
for i in range(0, 100):

  # Create a pattern of enablement so we can see all patterns of enabled/disabled
  # as well has having sequences where no changes happen.
  desired_adapter1_scale = i // 2 % 2 * 1.0 # Enable 2 out of every 4 times
  desired_adapter2_scale = i // 4 % 2 * 1.0 # Enable 4 out of every 8 times

  # Check current state - note that we treat the initial state when they are not
  # loaded as 0.0 to ensure we have a couple of tests without them loaded
  lora_adapters = llm.lora_adapters or {}
  current_adapter1_scale = lora_adapters.get(adapter1_gguf, 0.0)
  current_adapter2_scale = lora_adapters.get(adapter2_gguf, 0.0)

  if current_adapter1_scale != desired_adapter1_scale:
    llm.set_lora_adapter_scale(adapter1_gguf, desired_adapter1_scale)
    log_mem(f'after set adapter 1 scale {desired_adapter1_scale}')
  if current_adapter2_scale != desired_adapter2_scale:
    llm.set_lora_adapter_scale(adapter2_gguf, desired_adapter2_scale)
    log_mem(f'after set adapter 2 scale {desired_adapter2_scale}')

  llm.create_completion(seed=12345, temperature=0, max_tokens=16, prompt=str(i))
  log_mem(f'after completion "{i}"')

When I run this I see initial allocations in virtual memory (first column) but it stays stable after the adapters have been loaded. The RAM usages stays the same after various loads and unloads.

python memtest.py 2>/dev/null

====== initial                                        36,880,384 (     +36,880,384)       19,529,728 (     +19,529,728) ======
====== imported llama_cpp                            314,400,768 (    +277,520,384)       42,971,136 (     +23,441,408) ======
====== loaded model                                4,740,259,840 (  +4,425,859,072)    4,262,637,568 (  +4,219,666,432) ======
====== after completion "0"                        4,774,264,832 (     +34,004,992)    4,263,817,216 (      +1,179,648) ======
====== after completion "1"                        4,774,264,832 (              +0)    4,263,817,216 (              +0) ======
====== after set adapter 1 scale 1.0               4,789,526,528 (     +15,261,696)    4,279,021,568 (     +15,204,352) ======
====== after completion "2"                        4,789,526,528 (              +0)    4,279,021,568 (              +0) ======
====== after completion "3"                        4,789,526,528 (              +0)    4,279,021,568 (              +0) ======
====== after set adapter 1 scale 0.0               4,789,526,528 (              +0)    4,279,021,568 (              +0) ======
====== after set adapter 2 scale 1.0               4,803,383,296 (     +13,856,768)    4,292,915,200 (     +13,893,632) ======
====== after completion "4"                        4,803,383,296 (              +0)    4,292,915,200 (              +0) ======
====== after completion "5"                        4,803,383,296 (              +0)    4,292,915,200 (              +0) ======
====== after set adapter 1 scale 1.0               4,803,383,296 (              +0)    4,292,915,200 (              +0) ======
====== after completion "6"                        4,803,383,296 (              +0)    4,292,915,200 (              +0) ======
====== after completion "7"                        4,803,383,296 (              +0)    4,292,915,200 (              +0) ======
====== after set adapter 1 scale 0.0               4,803,383,296 (              +0)    4,292,915,200 (              +0) ======
====== after set adapter 2 scale 0.0               4,803,383,296 (              +0)    4,292,915,200 (              +0) ======
====== after completion "8"                        4,803,383,296 (              +0)    4,292,915,200 (              +0) ======
...

Small RSS changes

Note that I am seeing a small (128k) occasional increase in resident memory usage (last column), which could be a different kind of leak, for example Python VM operations such as GC not reclaiming everything straight away. I don't think this is your memory leak though because I would expect llama.cpp allocated memory to be reflected in an increase in the virtual memory (first column), not just an increase in resident memory. Nonetheless, worth keeping an eye on it.

...
====== after set adapter 1 scale 1.0               4,803,383,296 (              +0)    4,292,915,200 (              +0) ======
====== after completion "10"                       4,803,383,296 (              +0)    4,293,046,272 (        +131,072) ======
====== after completion "11"                       4,803,383,296 (              +0)    4,293,046,272 (              +0) ======
...
====== after set adapter 2 scale 0.0               4,803,383,296 (              +0)    4,293,046,272 (              +0) ======
====== after completion "72"                       4,803,383,296 (              +0)    4,293,177,344 (        +131,072) ======
====== after completion "73"                       4,803,383,296 (              +0)    4,293,177,344 (              +0) ======
...
====== after set adapter 2 scale 1.0               4,803,383,296 (              +0)    4,293,177,344 (              +0) ======
====== after completion "100"                      4,803,383,296 (              +0)    4,293,308,416 (        +131,072) ======
====== after completion "101"                      4,803,383,296 (              +0)    4,293,308,416 (              +0) ======
...
====== after completion "253"                      4,803,383,296 (              +0)    4,293,308,416 (              +0) ======
====== after set adapter 1 scale 1.0               4,803,383,296 (              +0)    4,293,308,416 (              +0) ======
...

Testing larger prompt and completion size

The previous test only tested completions for short numbers as prompts with very small max token size. A slightly larger test might show a leak.

I patched the create_completion call to generate something a bit larger. This used more memory but didn't seem to leak either.

  llm.create_completion(seed=12345, temperature=0, max_tokens=256, prompt=str(i) + ' the quick brown fox jumped over the lazy dog who knows what will come next with a longer prompt')

====== initial                                        36,884,480 (     +36,884,480)       19,660,800 (     +19,660,800) ======
====== imported llama_cpp                            314,400,768 (    +277,516,288)       42,971,136 (     +23,310,336) ======
====== loaded model                                4,740,255,744 (  +4,425,854,976)    4,262,821,888 (  +4,219,850,752) ======
====== after completion "0"                        4,774,723,584 (     +34,467,840)    4,267,278,336 (      +4,456,448) ======
====== after completion "1"                        4,774,723,584 (              +0)    4,267,278,336 (              +0) ======
====== after set adapter 1 scale 1.0               4,789,907,456 (     +15,183,872)    4,282,482,688 (     +15,204,352) ======
====== after completion "2"                        4,789,907,456 (              +0)    4,282,482,688 (              +0) ======
====== after completion "3"                        4,789,907,456 (              +0)    4,282,482,688 (              +0) ======
====== after set adapter 1 scale 0.0               4,789,907,456 (              +0)    4,282,482,688 (              +0) ======
====== after set adapter 2 scale 1.0               4,803,760,128 (     +13,852,672)    4,296,376,320 (     +13,893,632) ======
====== after completion "4"                        4,803,760,128 (              +0)    4,296,507,392 (        +131,072) ======
====== after completion "5"                        4,803,760,128 (              +0)    4,296,507,392 (              +0) ======
====== after set adapter 1 scale 1.0               4,803,760,128 (              +0)    4,296,507,392 (              +0) ======
====== after completion "6"                        4,803,760,128 (              +0)    4,296,507,392 (              +0) ======
====== after completion "7"                        4,803,760,128 (              +0)    4,296,507,392 (              +0) ======
====== after set adapter 1 scale 0.0               4,803,760,128 (              +0)    4,296,507,392 (              +0) ======
====== after set adapter 2 scale 0.0               4,803,760,128 (              +0)    4,296,507,392 (              +0) ======
====== after completion "8"                        4,803,760,128 (              +0)    4,296,507,392 (              +0) ======
====== after completion "9"                        4,803,760,128 (              +0)    4,296,507,392 (              +0) ======
====== after set adapter 1 scale 1.0               4,803,760,128 (              +0)    4,296,507,392 (              +0) ======
====== after completion "10"                       4,803,760,128 (              +0)    4,296,769,536 (        +262,144) ======
====== after completion "11"                       4,803,760,128 (              +0)    4,296,769,536 (              +0) ======
====== after set adapter 1 scale 0.0               4,803,760,128 (              +0)    4,296,769,536 (              +0) ======
====== after set adapter 2 scale 1.0               4,803,760,128 (              +0)    4,296,769,536 (              +0) ======
====== after completion "12"                       4,803,760,128 (              +0)    4,296,769,536 (              +0) ======
====== after completion "13"                       4,803,760,128 (              +0)    4,296,769,536 (              +0) ======
====== after set adapter 1 scale 1.0               4,803,760,128 (              +0)    4,296,769,536 (              +0) ======
====== after completion "14"                       4,803,760,128 (              +0)    4,296,769,536 (              +0) ======
====== after completion "15"                       4,803,760,128 (              +0)    4,296,769,536 (              +0) ======
====== after set adapter 1 scale 0.0               4,803,760,128 (              +0)    4,296,769,536 (              +0) ======
====== after set adapter 2 scale 0.0               4,803,760,128 (              +0)    4,296,769,536 (              +0) ======

Dec 06 '24 23:12 richdougherty

Thanks Rich. I am still seeing a memory leak in GPU. Will try a previous build without your changes and keep you posted

Dec 10 '24 22:12 hrsmanian

Thanks for checking. To confirm, you ran the script I posted above?

If you are still seeing the leak my theory is that there's a leak in the llama.cpp CUDA implementation, which is why you're seeing the leak but I'm not seeing it with the CPU backend.

Currently I'm not thinking the leak is in the Python bindings, because if it was then I think we should see the leak for both backends.

This is just my theory though. I would definitely want more info to confirm - eg test different backends, try to replicate in llama.cpp directly.

If you're able then, running the script above would be good. If you don't have a chance then I should be able to use a cloud server with a GPU to test. (I am investigating how to do that.)

Thanks a lot for your interest and for testing!

Dec 10 '24 22:12 richdougherty

Have a decent repro now I added the nvidia-smi output into your script. also, the model being used is the model i trained. Below is the output snapshot when no adapter is used. GPU memory remains constant. All Good.

====== after completion "1" 49,240,797,184 ( +532,480) 1,255,649,280 ( +659,456) ====== GPU Memory Used: [6729] ====== after completion "2" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ====== GPU Memory Used: [6729] ====== after completion "3" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ====== GPU Memory Used: [6729] ====== after completion "4" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ====== GPU Memory Used: [6729] ====== after completion "5" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ====== GPU Memory Used: [6729] ====== after completion "6" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ====== GPU Memory Used: [6729] ====== after completion "7" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ====== GPU Memory Used: [6729] ====== after completion "8" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ====== GPU Memory Used: [6729] ====== after completion "9" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ====== GPU Memory Used: [6729] ====== after completion "10" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ====== GPU Memory Used: [6729] ====== after completion "11" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ====== GPU Memory Used: [6729] ====== after completion "12" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ====== GPU Memory Used: [6729] ====== after completion "13" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ====== GPU Memory Used: [6729] ====== after completion "14" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ====== GPU Memory Used: [6729] ====== after completion "15"

Now below is the memory log when adapter is set. GPU memory increasing constantly

====== after completion "1" 49,240,805,376 ( +532,480) 1,255,985,152 ( +598,016) ====== GPU Memory Used: [6729] ====== after completion "2" 49,362,423,808 ( +121,618,432) 1,313,374,208 ( +57,389,056) ====== GPU Memory Used: [6773] ====== after completion "3" 49,362,423,808 ( +0) 1,313,374,208 ( +0) ====== GPU Memory Used: [6773] ====== after completion "4" 49,449,160,704 ( +86,736,896) 1,332,838,400 ( +19,464,192) ====== GPU Memory Used: [6811] ====== after completion "5" 49,449,160,704 ( +0) 1,332,838,400 ( +0) ====== GPU Memory Used: [6811] ====== after completion "6" 49,449,160,704 ( +0) 1,332,838,400 ( +0) ====== GPU Memory Used: [6819] ====== after completion "7" 49,449,160,704 ( +0) 1,332,838,400 ( +0) ====== GPU Memory Used: [6819] ====== after completion "8" 49,449,160,704 ( +0) 1,332,838,400 ( +0) ====== GPU Memory Used: [6819] ====== after completion "9" 49,449,160,704 ( +0) 1,332,838,400 ( +0) ====== GPU Memory Used: [6819] ====== after completion "10" 49,449,160,704 ( +0) 1,332,838,400 ( +0) ====== GPU Memory Used: [6821] ====== after completion "11" 49,449,160,704 ( +0) 1,332,838,400 ( +0) ====== GPU Memory Used: [6821] ====== after completion "12" 49,449,160,704 ( +0) 1,332,838,400 ( +0) ====== GPU Memory Used: [6823] ====== after completion "13" 49,449,160,704 ( +0) 1,332,838,400 ( +0) ====== GPU Memory Used: [6823] ====== after completion "14" 49,449,160,704 ( +0) 1,332,838,400 ( +0) ====== GPU Memory Used: [6825] ====== after completion "15" 49,449,160,704 ( +0) 1,332,838,400 ( +0) ======

Dec 11 '24 03:12 hrsmanian

And if i set adapter only once outside the loop, then no increase in gpu memory

====== after completion "0" 49,362,345,984 ( +34,562,404,352) 1,307,557,888 ( +186,347,520) ====== GPU Memory Used: [6773] ====== after completion "1" 49,362,878,464 ( +532,480) 1,308,160,000 ( +602,112) ====== GPU Memory Used: [6773] ====== after completion "2" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ====== GPU Memory Used: [6773] ====== after completion "3" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ====== GPU Memory Used: [6773] ====== after completion "4" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ====== GPU Memory Used: [6773] ====== after completion "5" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ====== GPU Memory Used: [6773] ====== after completion "6" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ====== GPU Memory Used: [6773] ====== after completion "7" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ====== GPU Memory Used: [6773] ====== after completion "8" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ====== GPU Memory Used: [6773] ====== after completion "9" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ====== GPU Memory Used: [6773] ====== after completion "10" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ====== GPU Memory Used: [6773] ====== after completion "11" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ====== GPU Memory Used: [6773] ====== after completion "12" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ====== GPU Memory Used: [6773] ====== after completion "13" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ====== GPU Memory Used: [6773] ====== after completion "14" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ====== GPU Memory Used: [6773] ====== after completion "15" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ======

Dec 11 '24 03:12 hrsmanian

Interesting behavior. If i just set one adapter outside the loop and increase the max_tokens=256 and start inference, I see memory increase by 8MB across inference

====== after completion "0" 49,363,697,664 ( +34,563,756,032) 1,309,958,144 ( +188,268,544) ====== GPU Memory Used: [6779] ====== after completion "1" 49,363,697,664 ( +0) 1,309,958,144 ( +0) ====== GPU Memory Used: [6787] ====== after completion "2" 49,363,697,664 ( +0) 1,309,958,144 ( +0) ====== GPU Memory Used: [6795] ====== after completion "3" 49,363,697,664 ( +0) 1,309,958,144 ( +0) ====== GPU Memory Used: [6803] ====== after completion "4" 49,363,832,832 ( +135,168) 1,310,285,824 ( +327,680) ====== GPU Memory Used: [6811] ====== after completion "5" 49,363,832,832 ( +0) 1,310,285,824 ( +0) ====== GPU Memory Used: [6819] ====== after completion "6" 49,363,832,832 ( +0) 1,310,285,824 ( +0) ====== GPU Memory Used: [6827] ====== after completion "7" 49,363,968,000 ( +135,168) 1,310,285,824 ( +0) ====== GPU Memory Used: [6835] ====== after completion "8" 49,363,968,000 ( +0) 1,310,285,824 ( +0) ====== GPU Memory Used: [6843] ====== after completion "9" 49,397,522,432 ( +33,554,432) 1,310,474,240 ( +188,416) ====== GPU Memory Used: [6851] ====== after completion "10" 49,397,657,600 ( +135,168) 1,310,474,240 ( +0) ====== GPU Memory Used: [6859] ====== after completion "11" 49,397,657,600 ( +0) 1,310,474,240 ( +0) ====== GPU Memory Used: [6867] ====== after completion "12" 49,397,657,600 ( +0) 1,310,474,240 ( +0) ====== GPU Memory Used: [6875] ====== after completion "13" 49,431,212,032 ( +33,554,432) 1,310,474,240 ( +0) ====== GPU Memory Used: [6883] ====== after completion "14" 49,431,347,200 ( +135,168) 1,310,474,240 ( +0) ====== GPU Memory Used: [6891] ====== after completion "15" 49,431,347,200 ( +0) 1,310,474,240 ( +0) ====== GPU Memory Used: [6899]

Dec 11 '24 03:12 hrsmanian

Thanks for confirming that. To summarise the info:

when no adapter is used. GPU memory remains constant

when adapter is set. GPU memory increasing constantly

(note: assume this means adapter is set in the loop using the code I sent?)

set adapter only once outside the loop, then no increase in gpu memory

set one adapter outside the loop and increase the max_tokens=256 and start inference, I see memory increase by 8MB across inference

I may try to write the same loop code using the llama.cpp C++ library directly, to try and isolate any issues from the Python bindings in this PR. (You are welcome to have a go with writing C++ if you wish, otherwise I will get to it this week.) I suspect an issue in the llama.cpp C++ layer due to the way it varies with different backends. But we will need a nice repro to isolate that and get help from the llama.cpp devs.

I will try to reproduce on GPU and maybe another backend like Vulkan, since CPU is not showing anything for me.

Another thing you could do perhaps that may clarify when memory is leaked would be to try logging these messages after any LoRA set adapter calls. That will show memory allocated on the LoRA load operation (if any).

===== after set adapter 2 scale 0.0               4,803,383,296 (              +0)    4,293,046,272 (              +0) ======

Also perhaps we should log or vary the max_tokens since that seems relevant?

Dec 11 '24 21:12 richdougherty

All your statements above are true. I can summarize even further

When no adapter is set, no memory increase
when adapter is set inside or outside the loop and max_tokens=16, memory increases but at a small rate
When adapter is set inside or outside the loop and max_tokens=256, memory increases by about 8Mb for each inference

Can you share how to run llama.cpp cmdline. I can run it on a gpu I have access to

Dec 11 '24 22:12 hrsmanian

Good idea to try the llama.cpp command line.

The compiled llama.cpp for the Python bindings is in the vendor subdirectory.

There is a normal llama.cpp cli but I'm not sure if it supports running multiple completions in a single session.

Perhaps you can try running the server and then calling it multiple times with curl or via the ui?

It's in the examples/server subdirectory.

https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md

You can load a LoRA with --lora or --lora-scaled. It should be possible to set the seed/max tokens etc to match the test case.

Dec 11 '24 23:12 richdougherty

Hi @hrsmanian , here is a Bash scrip to test against llama.cpp.

First, compile the llama-server binary. This should be in the llama-cpp-python source directory.

cd vendor/llama.cpp/
make llama-server

Then run the below script, llama-server-memtest.sh.

#!/bin/bash

# Function to clean up server process
cleanup() {
    local exit_code=$?
    echo "Cleaning up..."
    if [ ! -z "$SERVER_PID" ]; then
        kill $SERVER_PID 2>/dev/null
        wait $SERVER_PID 2>/dev/null
    fi
    exit $exit_code
}

# Set up trap for script exit
trap cleanup EXIT

# Start llama-server in background
./llama-server \
  --model "$MODEL_GGUF" \
  --lora "$ADAPTER1_GGUF" &

# Save server PID
SERVER_PID=$!

# Wait for server to start up
sleep 5

# Function to log memory usage
log_memory() {
    local msg=$1
    # Get virtual and resident memory in bytes
    local mem=$(ps -o vsz=,rss= -p $SERVER_PID)
    local vsz=$(echo $mem | cut -d' ' -f1)
    local rss=$(echo $mem | cut -d' ' -f2)
    
    # Convert to bytes (ps shows KB)
    vsz=$((vsz * 1024))
    rss=$((rss * 1024))
    
    # Calculate deltas
    if [ -z "$PREV_VSZ" ]; then
        PREV_VSZ=$vsz
        PREV_RSS=$rss
    fi
    
    local delta_vsz=$((vsz - PREV_VSZ))
    local delta_rss=$((rss - PREV_RSS))
    
    # Format with commas for readability
    printf "====== %-40s %'16d (%+'16d) %'16d (%+'16d) ======\n" \
        "$msg" $vsz $delta_vsz $rss $delta_rss
    
    PREV_VSZ=$vsz
    PREV_RSS=$rss
}

# Log initial memory state
log_memory "initial"

# Run completions in a loop
for i in {1..100}; do
    curl --silent --request POST \
        --url http://127.0.0.1:8080/completion \
        --header "Content-Type: application/json" \
        --data "{\"seed\":12345,\"max_tokens\":16,\"temperature\":0,\"prompt\": \"$i\"}" \
        > /dev/null
    
    log_memory "after completion \"$i\""
done

When I run it I get output like:

$ ./llama-server-memtest.sh 2>&1 | tee server-memtest.log
...run for awhile...
^C <interrupt>

$ cat server-memtest.log | grep ===
====== initial                                    11,023,482,880 (              +0)    8,471,502,848 (              +0) ======
====== after completion "1"                       11,023,482,880 (              +0)    8,471,502,848 (              +0) ======
====== after completion "2"                       11,090,591,744 (     +67,108,864)    8,471,764,992 (        +262,144) ======
====== after completion "3"                       11,157,700,608 (     +67,108,864)    8,471,896,064 (        +131,072) ======
====== after completion "4"                       11,224,809,472 (     +67,108,864)    8,471,896,064 (              +0) ======
====== after completion "5"                       11,291,918,336 (     +67,108,864)    8,471,896,064 (              +0) ======
====== after completion "6"                       11,291,918,336 (              +0)    8,471,896,064 (              +0) ======
====== after completion "7"                       11,291,918,336 (              +0)    8,471,896,064 (              +0) ======
====== after completion "8"                       11,359,027,200 (     +67,108,864)    8,471,896,064 (              +0) ======
====== after completion "9"                       11,359,027,200 (              +0)    8,471,896,064 (              +0) ======
====== after completion "10"                      11,359,027,200 (              +0)    8,471,896,064 (              +0) ======
====== after completion "11"                      11,359,027,200 (              +0)    8,472,027,136 (        +131,072) ======
====== after completion "12"                      11,359,027,200 (              +0)    8,472,027,136 (              +0) ======
====== after completion "13"                      11,359,027,200 (              +0)    8,472,027,136 (              +0) ======
====== after completion "14"                      11,359,027,200 (              +0)    8,472,027,136 (              +0) ======
...
====== after completion "28"                      11,359,027,200 (              +0)    8,472,027,136 (              +0) ======
====== after completion "29"                      11,359,027,200 (              +0)    8,472,027,136 (              +0) ======
====== after completion "30"                      11,359,027,200 (              +0)    8,472,027,136 (              +0) ======

There is memory growth, but it stabilises after awhile. The server might allocate IO buffers, perhaps it's doing caching, etc. It probably needs more analysis to know if there is a leak. I thought I'd share the script so you can look at GPU memory usage. For a really pure reproduction, we may need to write C++ code that uses the plain llama.cpp API, but testing with the llama-server app first is a good start.

Dec 13 '24 07:12 richdougherty

Any progress on this? This would be a really helpful feature.

Jan 04 '25 00:01 SubatomicPlanets

Can LoRA hotswapping more effective than reload model with LoRA adapter ? In my test, reload model is enough for me.🤔

Aug 11 '25 08:08 ExcitingFrog