Support LoRA hotswapping and multiple LoRAs at a time
This is a PR to add support for loading and changing LoRA adapters at runtime as introduced into llama.cpp in https://github.com/ggerganov/llama.cpp/pull/8332 by @ngxson. Adding this support should allow things like loading a base model, then swapping adapters in and out to support different features and behaviours. This could be really useful in smaller environments where we might use smaller models but want to support a variety of capabilities. (This appears to be the approach taken by some commercial mobile device makers.)
The list of changes from upstream in https://github.com/ggerganov/llama.cpp/pull/8332 are:
- Refactor lora API
- Allow hot-swapping lora
- Added struct llama_lora_adapter to keep track of loaded lora
I have made some llama-cpp-python changes to enable this support:
- Updated C wrappers
- Added
_internals.LlamaLoraAdapterto wrap llama.cpp'sllama_lora_adapter - Modified wrapper lifecycle to free
llama_lora_adaptercorrectly - Added high level API in Llama wrapper - now supports a dict of LoRA adapters to reflect llama.cpp's support for multiple LoRAs; also has method for changing LoRA scales
- Updated cache to have LoRA adapter weights in cache keys, because different active LoRAs will have different cache state
- Updated server to support hot-swapping LoRAs when a base model is shared
I have an example of usage through the API and via the server here: https://gist.github.com/richdougherty/dd4961a72fafbff5216b4bc9f48b1147#file-lora-md
Example API usage:
>>> import llama_cpp
>>> llm = llama_cpp.Llama("<model>") # Can also add LoRAs in dict here
>>> llm.lora_adapters
>>> llm.create_completion(seed=12345, temperature=0, max_tokens=256, prompt=".....", stop=["\n"])
>>> llm.set_lora_adapter_scale('./adapters/lora_tldr_headline_gen.gguf', 1.0)
>>> llm.lora_adapters
{'./adapters/lora_tldr_headline_gen.gguf': 1.0}
>>> llm.create_completion(seed=12345, temperature=0, max_tokens=256, prompt=".....", stop=["\n"])
>>> llm.set_lora_adapter_scale('./adapters/lora_tldr_headline_gen.gguf', 0.0)
>>> llm.lora_adapters
{'./adapters/lora_tldr_headline_gen.gguf': 0.0}
>>> llm.set_lora_adapter_scale('./adapters/lora_tldr_content_gen.gguf', 1.0)
>>> llm.lora_adapters
{'./adapters/lora_tldr_headline_gen.gguf': 0.0, './adapters/lora_tldr_content_gen.gguf': 1.0}
>>> llm.create_completion(seed=12345, temperature=0, max_tokens=256, prompt=".....", stop=["\n"])
Still working on this. Just added support to the OpenAI-compatible server for hot-swapping LoRAs via model aliases. This allows fast serving of different LoRA adapters that extend the same base model with minimal switching overhead.
{
"host": "0.0.0.0",
"port": 8080,
"models": [
{
"model_alias": "mistral",
"model": "./mistral-7b-v0.1.Q4_K_S.gguf",
"verbose": true
},
{
"model_alias": "mistral-magicoder",
"model": "./mistral-7b-v0.1.Q4_K_S.gguf",
"lora_adapters": {
"./magicoder-lora-mistral-7b-v0.1.gguf": 1.0
},
"verbose": true
},
{
"model_alias": "mistral-conllpp",
"model": "./mistral-7b-v0.1.Q4_K_S.gguf",
"lora_adapters": {
"./conllpp-lora-mistral-7b-v0.1.gguf": 1.0
},
"verbose": true
}
]
}
Then calling the OpenAI compatible API with "model": "mistral, "model": "mistral-magicoder, "model": "mistral-conllpp" will result in a hot-swap, e.g
Hot-swapping model, setting existing LoRA adapter scales to 0.0.
Hot-swapping model, setting LoRA adapter scales for mistral-conllpp.
llama_lora_adapter_init_internal: loading lora adapter from './conllpp-lora-mistral-7b-v0.1.gguf' ...
llama_lora_adapter_init_internal: CPU_Mapped LoRA buffer size = 13.00 MiB
llama_lora_adapter_init_internal: loaded 128 tensors from lora file
This seems to be a cool feature to have. Any idea when this will be available?
The code is pretty much done and working. I plan to tidy it up a little this weekend, ready for review and (hopefully) merge.
Thanks Rich. Let me know when I can try it out.
The code is ready for review now, thanks for your patience!
@hrsmanian , if you want to try it out before it is merged, a guide for usage is here: https://gist.github.com/richdougherty/dd4961a72fafbff5216b4bc9f48b1147
The code is ready for review now, thanks for your patience!
@hrsmanian , if you want to try it out before it is merged, a guide for usage is here: https://gist.github.com/richdougherty/dd4961a72fafbff5216b4bc9f48b1147
Thanks Rich. Seems to work as expected overall. However, when I run it on 1000's of converasations, i see a constant increase in gpu memory. Could there be a memory leak? Or should we completely disable Lora adapters
However, when I run it on 1000's of converasations, i see a constant increase in gpu memory. Could there be a memory leak? Or should we completely disable Lora adapters
That sounds like a memory leak for sure. Can you give me a bit more information about how you're running it? Are you loading (and unloading) LoRA adapters? Are you using the http server or the API? Also any info on how you're measuring memory usage, how fast it is growing, etc might be useful. Thanks!
@richdougherty , I am running offline inference in a loop loading one conversation at a time And in another window, I run nvidia-smi to see the gpu memory usage. It increases by roughly 4Mb for each inference. I load the model and set the lora as below.
import llama_cpp import json import time
llm = llama_cpp.Llama("gguf_models/Llama-3.2-3B-Instruct-f16.gguf", n_gpu_layers=-1, verbose=False, n_ctx=6000)
#loading both the adapters llm.set_lora_adapter_scale('gguf_models/lora_3b_exp42_f16.gguf', 0.0) llm.set_lora_adapter_scale('gguf_models/lora_3b_exp43_f16.gguf', 0.0)
infile = "val_English.jsonl" fp = open (infile, "r") time1 = 0.0 time2 = 0.0 count = 0 for line in fp: json_data = json.loads(line.strip()) dialogue = json_data['dialogue']
model_prompt = f"Summarize the text"
s1 = time.time()
############## activate 1st adapter and disable 2nd adapter
llm.set_lora_adapter_scale('gguf_models/lora_3b_exp42_f16.gguf', 1.0)
llm.set_lora_adapter_scale('gguf_models/lora_3b_exp43_f16.gguf', 0.0)
print(f"LoRA State1: {llm.lora_adapters}")
prompt_1 = f"""<|start_header_id|>system<|end_header_id|>
You are a helpful AI assistant for conversation summarization<|eot_id|><|start_header_id|>user<|end_header_id|>
{model_prompt}: {dialogue}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
""" output = llm.create_completion(seed=12345, temperature=0.01, top_p=0.99, top_k=250, max_tokens=256, prompt=prompt_1, stop=['<|eot_id|>']) time1 += time.time() - s1 print_out1 = f"""MODEL OUTPUT:\n\n {output['choices'][0]['text']}
Usage:
Input Tokens: {output['usage']['prompt_tokens']}
Output Tokens: {output['usage']['completion_tokens']}
Total Tokens: {output['usage']['total_tokens']}
"""
print(print_out1)
Hi, anymore information needed on this? Kindly let me know
Hi @hrsmanian, apologies for the time to get back. Based on your explanation - seeing GPU memory usage increasing - it sounds like a leak in the llama.cpp allocated GPU memory. This could indicate either a bug in llama.cpp's LoRA adapter code or - more likely! - a bug in the bindings that I wrote.
A bug in the llama-cpp-python bindings in this PR would be something like incorrectly using the llama.cpp API and therefore causing those extra GPU allocations for LoRA adapters and perhaps forgetting to deallocate somehow.
Unfortunately, I only have a CPU for inference, but I believe I should be able to spot an incorrect usage of the llama.cpp LoRA API by seeing RAM usage increasing for the Python process. This is because llama.cpp will use the process RAM to store the models and adapters when using CPU inference. So because I'm using system RAM then I think that any mistakes with allocating or deallocating LoRA adapters should show up in the Python RAM usage directly as an increase in virtual memory usage.
Note: a Python object memory leak in the wrapper objects might not show up, since the Python heap might not visibly grow with each object due to its garbage-collected nature. But I think that looking at Python process RAM usage should be good enough to show the kind of leak that you're talking about in llama.cpp allocated memory. The memory allocated by llama.cpp won't be using the Python garbage-collected heap for its memory allocation, therefore it should be allocating memory directly which will be visible in the process memory used.
Test script
I've written a little script to test this. I'm using psutil to check the Python process RAM usage after each operation. This works for me for testing a leak when using CPU for inference.
I do not see a leak using this to test. Certainly nothing on the order of 4MB for each inference. Would you mind checking the script on your machine as well? I can also test your script if you want to link to your models / adapters and dataset you use to test, but I understand that these might be private.
You can try it using psutil to view RAM usage. To help with the GPU memory leak debugging perhaps you could edit the script so that it reports GPU usage. I saw there are a couple of libraries that can help with this (not tested by me). These are https://github.com/pmav99/nvsmi or https://pypi.org/project/nvidia-ml-py/ . You could patch the log_mem function to report GPU memory usage. That could give a useful log showing any leaks in GPU memory.
For my tests I am using the model and adapters described in the guide I wrote before: https://gist.github.com/richdougherty/dd4961a72fafbff5216b4bc9f48b1147 . This test only tests very short prompts and completions. I tried a slightly larger prompt and completion further down in this comment, but perhaps you can test using your dataset as well.
export MODEL_GGUF=$(huggingface-cli download TheBloke/Mistral-7B-v0.1-GGUF mistral-7b-v0.1.Q4_K_S.gguf)
export ADAPTER1_GGUF=./adapters/lora_tldr_headline_gen.gguf
export ADAPTER2_GGUF=./adapters/lora_tldr_content_gen.gguf
pip install psutil
python memtest.py 2>&1
memtest.py
import os
model_gguf = os.environ['MODEL_GGUF']
adapter1_gguf = os.environ['ADAPTER1_GGUF']
adapter2_gguf = os.environ['ADAPTER2_GGUF']
import psutil
process = psutil.Process()
prev_vms = 0
prev_rss = 0
def log_mem(msg):
global prev_rss, prev_vms
pmem = process.memory_info()
vms = pmem.vms
rss = pmem.rss
delta_vms = vms - prev_vms
delta_rss = rss - prev_rss
print(f'====== {msg:<40} {vms:>16,} ({delta_vms:>+16,}) {rss:>16,} ({delta_rss:>+16,}) ======')
prev_vms = vms
prev_rss = rss
log_mem('initial')
import llama_cpp
log_mem('imported llama_cpp')
llm = llama_cpp.Llama(model_gguf)
log_mem('loaded model')
i = 0
for i in range(0, 100):
# Create a pattern of enablement so we can see all patterns of enabled/disabled
# as well has having sequences where no changes happen.
desired_adapter1_scale = i // 2 % 2 * 1.0 # Enable 2 out of every 4 times
desired_adapter2_scale = i // 4 % 2 * 1.0 # Enable 4 out of every 8 times
# Check current state - note that we treat the initial state when they are not
# loaded as 0.0 to ensure we have a couple of tests without them loaded
lora_adapters = llm.lora_adapters or {}
current_adapter1_scale = lora_adapters.get(adapter1_gguf, 0.0)
current_adapter2_scale = lora_adapters.get(adapter2_gguf, 0.0)
if current_adapter1_scale != desired_adapter1_scale:
llm.set_lora_adapter_scale(adapter1_gguf, desired_adapter1_scale)
log_mem(f'after set adapter 1 scale {desired_adapter1_scale}')
if current_adapter2_scale != desired_adapter2_scale:
llm.set_lora_adapter_scale(adapter2_gguf, desired_adapter2_scale)
log_mem(f'after set adapter 2 scale {desired_adapter2_scale}')
llm.create_completion(seed=12345, temperature=0, max_tokens=16, prompt=str(i))
log_mem(f'after completion "{i}"')
When I run this I see initial allocations in virtual memory (first column) but it stays stable after the adapters have been loaded. The RAM usages stays the same after various loads and unloads.
python memtest.py 2>/dev/null
====== initial 36,880,384 ( +36,880,384) 19,529,728 ( +19,529,728) ======
====== imported llama_cpp 314,400,768 ( +277,520,384) 42,971,136 ( +23,441,408) ======
====== loaded model 4,740,259,840 ( +4,425,859,072) 4,262,637,568 ( +4,219,666,432) ======
====== after completion "0" 4,774,264,832 ( +34,004,992) 4,263,817,216 ( +1,179,648) ======
====== after completion "1" 4,774,264,832 ( +0) 4,263,817,216 ( +0) ======
====== after set adapter 1 scale 1.0 4,789,526,528 ( +15,261,696) 4,279,021,568 ( +15,204,352) ======
====== after completion "2" 4,789,526,528 ( +0) 4,279,021,568 ( +0) ======
====== after completion "3" 4,789,526,528 ( +0) 4,279,021,568 ( +0) ======
====== after set adapter 1 scale 0.0 4,789,526,528 ( +0) 4,279,021,568 ( +0) ======
====== after set adapter 2 scale 1.0 4,803,383,296 ( +13,856,768) 4,292,915,200 ( +13,893,632) ======
====== after completion "4" 4,803,383,296 ( +0) 4,292,915,200 ( +0) ======
====== after completion "5" 4,803,383,296 ( +0) 4,292,915,200 ( +0) ======
====== after set adapter 1 scale 1.0 4,803,383,296 ( +0) 4,292,915,200 ( +0) ======
====== after completion "6" 4,803,383,296 ( +0) 4,292,915,200 ( +0) ======
====== after completion "7" 4,803,383,296 ( +0) 4,292,915,200 ( +0) ======
====== after set adapter 1 scale 0.0 4,803,383,296 ( +0) 4,292,915,200 ( +0) ======
====== after set adapter 2 scale 0.0 4,803,383,296 ( +0) 4,292,915,200 ( +0) ======
====== after completion "8" 4,803,383,296 ( +0) 4,292,915,200 ( +0) ======
...
Small RSS changes
Note that I am seeing a small (128k) occasional increase in resident memory usage (last column), which could be a different kind of leak, for example Python VM operations such as GC not reclaiming everything straight away. I don't think this is your memory leak though because I would expect llama.cpp allocated memory to be reflected in an increase in the virtual memory (first column), not just an increase in resident memory. Nonetheless, worth keeping an eye on it.
...
====== after set adapter 1 scale 1.0 4,803,383,296 ( +0) 4,292,915,200 ( +0) ======
====== after completion "10" 4,803,383,296 ( +0) 4,293,046,272 ( +131,072) ======
====== after completion "11" 4,803,383,296 ( +0) 4,293,046,272 ( +0) ======
...
====== after set adapter 2 scale 0.0 4,803,383,296 ( +0) 4,293,046,272 ( +0) ======
====== after completion "72" 4,803,383,296 ( +0) 4,293,177,344 ( +131,072) ======
====== after completion "73" 4,803,383,296 ( +0) 4,293,177,344 ( +0) ======
...
====== after set adapter 2 scale 1.0 4,803,383,296 ( +0) 4,293,177,344 ( +0) ======
====== after completion "100" 4,803,383,296 ( +0) 4,293,308,416 ( +131,072) ======
====== after completion "101" 4,803,383,296 ( +0) 4,293,308,416 ( +0) ======
...
====== after completion "253" 4,803,383,296 ( +0) 4,293,308,416 ( +0) ======
====== after set adapter 1 scale 1.0 4,803,383,296 ( +0) 4,293,308,416 ( +0) ======
...
Testing larger prompt and completion size
The previous test only tested completions for short numbers as prompts with very small max token size. A slightly larger test might show a leak.
I patched the create_completion call to generate something a bit larger. This used more memory but didn't seem to leak either.
llm.create_completion(seed=12345, temperature=0, max_tokens=256, prompt=str(i) + ' the quick brown fox jumped over the lazy dog who knows what will come next with a longer prompt')
====== initial 36,884,480 ( +36,884,480) 19,660,800 ( +19,660,800) ======
====== imported llama_cpp 314,400,768 ( +277,516,288) 42,971,136 ( +23,310,336) ======
====== loaded model 4,740,255,744 ( +4,425,854,976) 4,262,821,888 ( +4,219,850,752) ======
====== after completion "0" 4,774,723,584 ( +34,467,840) 4,267,278,336 ( +4,456,448) ======
====== after completion "1" 4,774,723,584 ( +0) 4,267,278,336 ( +0) ======
====== after set adapter 1 scale 1.0 4,789,907,456 ( +15,183,872) 4,282,482,688 ( +15,204,352) ======
====== after completion "2" 4,789,907,456 ( +0) 4,282,482,688 ( +0) ======
====== after completion "3" 4,789,907,456 ( +0) 4,282,482,688 ( +0) ======
====== after set adapter 1 scale 0.0 4,789,907,456 ( +0) 4,282,482,688 ( +0) ======
====== after set adapter 2 scale 1.0 4,803,760,128 ( +13,852,672) 4,296,376,320 ( +13,893,632) ======
====== after completion "4" 4,803,760,128 ( +0) 4,296,507,392 ( +131,072) ======
====== after completion "5" 4,803,760,128 ( +0) 4,296,507,392 ( +0) ======
====== after set adapter 1 scale 1.0 4,803,760,128 ( +0) 4,296,507,392 ( +0) ======
====== after completion "6" 4,803,760,128 ( +0) 4,296,507,392 ( +0) ======
====== after completion "7" 4,803,760,128 ( +0) 4,296,507,392 ( +0) ======
====== after set adapter 1 scale 0.0 4,803,760,128 ( +0) 4,296,507,392 ( +0) ======
====== after set adapter 2 scale 0.0 4,803,760,128 ( +0) 4,296,507,392 ( +0) ======
====== after completion "8" 4,803,760,128 ( +0) 4,296,507,392 ( +0) ======
====== after completion "9" 4,803,760,128 ( +0) 4,296,507,392 ( +0) ======
====== after set adapter 1 scale 1.0 4,803,760,128 ( +0) 4,296,507,392 ( +0) ======
====== after completion "10" 4,803,760,128 ( +0) 4,296,769,536 ( +262,144) ======
====== after completion "11" 4,803,760,128 ( +0) 4,296,769,536 ( +0) ======
====== after set adapter 1 scale 0.0 4,803,760,128 ( +0) 4,296,769,536 ( +0) ======
====== after set adapter 2 scale 1.0 4,803,760,128 ( +0) 4,296,769,536 ( +0) ======
====== after completion "12" 4,803,760,128 ( +0) 4,296,769,536 ( +0) ======
====== after completion "13" 4,803,760,128 ( +0) 4,296,769,536 ( +0) ======
====== after set adapter 1 scale 1.0 4,803,760,128 ( +0) 4,296,769,536 ( +0) ======
====== after completion "14" 4,803,760,128 ( +0) 4,296,769,536 ( +0) ======
====== after completion "15" 4,803,760,128 ( +0) 4,296,769,536 ( +0) ======
====== after set adapter 1 scale 0.0 4,803,760,128 ( +0) 4,296,769,536 ( +0) ======
====== after set adapter 2 scale 0.0 4,803,760,128 ( +0) 4,296,769,536 ( +0) ======
Thanks Rich. I am still seeing a memory leak in GPU. Will try a previous build without your changes and keep you posted
Thanks for checking. To confirm, you ran the script I posted above?
If you are still seeing the leak my theory is that there's a leak in the llama.cpp CUDA implementation, which is why you're seeing the leak but I'm not seeing it with the CPU backend.
Currently I'm not thinking the leak is in the Python bindings, because if it was then I think we should see the leak for both backends.
This is just my theory though. I would definitely want more info to confirm - eg test different backends, try to replicate in llama.cpp directly.
If you're able then, running the script above would be good. If you don't have a chance then I should be able to use a cloud server with a GPU to test. (I am investigating how to do that.)
Thanks a lot for your interest and for testing!
Have a decent repro now I added the nvidia-smi output into your script. also, the model being used is the model i trained. Below is the output snapshot when no adapter is used. GPU memory remains constant. All Good.
====== after completion "1" 49,240,797,184 ( +532,480) 1,255,649,280 ( +659,456) ====== GPU Memory Used: [6729] ====== after completion "2" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ====== GPU Memory Used: [6729] ====== after completion "3" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ====== GPU Memory Used: [6729] ====== after completion "4" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ====== GPU Memory Used: [6729] ====== after completion "5" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ====== GPU Memory Used: [6729] ====== after completion "6" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ====== GPU Memory Used: [6729] ====== after completion "7" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ====== GPU Memory Used: [6729] ====== after completion "8" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ====== GPU Memory Used: [6729] ====== after completion "9" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ====== GPU Memory Used: [6729] ====== after completion "10" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ====== GPU Memory Used: [6729] ====== after completion "11" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ====== GPU Memory Used: [6729] ====== after completion "12" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ====== GPU Memory Used: [6729] ====== after completion "13" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ====== GPU Memory Used: [6729] ====== after completion "14" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ====== GPU Memory Used: [6729] ====== after completion "15"
Now below is the memory log when adapter is set. GPU memory increasing constantly
====== after completion "1" 49,240,805,376 ( +532,480) 1,255,985,152 ( +598,016) ====== GPU Memory Used: [6729] ====== after completion "2" 49,362,423,808 ( +121,618,432) 1,313,374,208 ( +57,389,056) ====== GPU Memory Used: [6773] ====== after completion "3" 49,362,423,808 ( +0) 1,313,374,208 ( +0) ====== GPU Memory Used: [6773] ====== after completion "4" 49,449,160,704 ( +86,736,896) 1,332,838,400 ( +19,464,192) ====== GPU Memory Used: [6811] ====== after completion "5" 49,449,160,704 ( +0) 1,332,838,400 ( +0) ====== GPU Memory Used: [6811] ====== after completion "6" 49,449,160,704 ( +0) 1,332,838,400 ( +0) ====== GPU Memory Used: [6819] ====== after completion "7" 49,449,160,704 ( +0) 1,332,838,400 ( +0) ====== GPU Memory Used: [6819] ====== after completion "8" 49,449,160,704 ( +0) 1,332,838,400 ( +0) ====== GPU Memory Used: [6819] ====== after completion "9" 49,449,160,704 ( +0) 1,332,838,400 ( +0) ====== GPU Memory Used: [6819] ====== after completion "10" 49,449,160,704 ( +0) 1,332,838,400 ( +0) ====== GPU Memory Used: [6821] ====== after completion "11" 49,449,160,704 ( +0) 1,332,838,400 ( +0) ====== GPU Memory Used: [6821] ====== after completion "12" 49,449,160,704 ( +0) 1,332,838,400 ( +0) ====== GPU Memory Used: [6823] ====== after completion "13" 49,449,160,704 ( +0) 1,332,838,400 ( +0) ====== GPU Memory Used: [6823] ====== after completion "14" 49,449,160,704 ( +0) 1,332,838,400 ( +0) ====== GPU Memory Used: [6825] ====== after completion "15" 49,449,160,704 ( +0) 1,332,838,400 ( +0) ======
And if i set adapter only once outside the loop, then no increase in gpu memory
====== after completion "0" 49,362,345,984 ( +34,562,404,352) 1,307,557,888 ( +186,347,520) ====== GPU Memory Used: [6773] ====== after completion "1" 49,362,878,464 ( +532,480) 1,308,160,000 ( +602,112) ====== GPU Memory Used: [6773] ====== after completion "2" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ====== GPU Memory Used: [6773] ====== after completion "3" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ====== GPU Memory Used: [6773] ====== after completion "4" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ====== GPU Memory Used: [6773] ====== after completion "5" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ====== GPU Memory Used: [6773] ====== after completion "6" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ====== GPU Memory Used: [6773] ====== after completion "7" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ====== GPU Memory Used: [6773] ====== after completion "8" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ====== GPU Memory Used: [6773] ====== after completion "9" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ====== GPU Memory Used: [6773] ====== after completion "10" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ====== GPU Memory Used: [6773] ====== after completion "11" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ====== GPU Memory Used: [6773] ====== after completion "12" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ====== GPU Memory Used: [6773] ====== after completion "13" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ====== GPU Memory Used: [6773] ====== after completion "14" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ====== GPU Memory Used: [6773] ====== after completion "15" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ======
Interesting behavior. If i just set one adapter outside the loop and increase the max_tokens=256 and start inference, I see memory increase by 8MB across inference
====== after completion "0" 49,363,697,664 ( +34,563,756,032) 1,309,958,144 ( +188,268,544) ====== GPU Memory Used: [6779] ====== after completion "1" 49,363,697,664 ( +0) 1,309,958,144 ( +0) ====== GPU Memory Used: [6787] ====== after completion "2" 49,363,697,664 ( +0) 1,309,958,144 ( +0) ====== GPU Memory Used: [6795] ====== after completion "3" 49,363,697,664 ( +0) 1,309,958,144 ( +0) ====== GPU Memory Used: [6803] ====== after completion "4" 49,363,832,832 ( +135,168) 1,310,285,824 ( +327,680) ====== GPU Memory Used: [6811] ====== after completion "5" 49,363,832,832 ( +0) 1,310,285,824 ( +0) ====== GPU Memory Used: [6819] ====== after completion "6" 49,363,832,832 ( +0) 1,310,285,824 ( +0) ====== GPU Memory Used: [6827] ====== after completion "7" 49,363,968,000 ( +135,168) 1,310,285,824 ( +0) ====== GPU Memory Used: [6835] ====== after completion "8" 49,363,968,000 ( +0) 1,310,285,824 ( +0) ====== GPU Memory Used: [6843] ====== after completion "9" 49,397,522,432 ( +33,554,432) 1,310,474,240 ( +188,416) ====== GPU Memory Used: [6851] ====== after completion "10" 49,397,657,600 ( +135,168) 1,310,474,240 ( +0) ====== GPU Memory Used: [6859] ====== after completion "11" 49,397,657,600 ( +0) 1,310,474,240 ( +0) ====== GPU Memory Used: [6867] ====== after completion "12" 49,397,657,600 ( +0) 1,310,474,240 ( +0) ====== GPU Memory Used: [6875] ====== after completion "13" 49,431,212,032 ( +33,554,432) 1,310,474,240 ( +0) ====== GPU Memory Used: [6883] ====== after completion "14" 49,431,347,200 ( +135,168) 1,310,474,240 ( +0) ====== GPU Memory Used: [6891] ====== after completion "15" 49,431,347,200 ( +0) 1,310,474,240 ( +0) ====== GPU Memory Used: [6899]
Thanks for confirming that. To summarise the info:
when no adapter is used. GPU memory remains constant
when adapter is set. GPU memory increasing constantly
(note: assume this means adapter is set in the loop using the code I sent?)
set adapter only once outside the loop, then no increase in gpu memory
set one adapter outside the loop and increase the max_tokens=256 and start inference, I see memory increase by 8MB across inference
I may try to write the same loop code using the llama.cpp C++ library directly, to try and isolate any issues from the Python bindings in this PR. (You are welcome to have a go with writing C++ if you wish, otherwise I will get to it this week.) I suspect an issue in the llama.cpp C++ layer due to the way it varies with different backends. But we will need a nice repro to isolate that and get help from the llama.cpp devs.
I will try to reproduce on GPU and maybe another backend like Vulkan, since CPU is not showing anything for me.
Another thing you could do perhaps that may clarify when memory is leaked would be to try logging these messages after any LoRA set adapter calls. That will show memory allocated on the LoRA load operation (if any).
===== after set adapter 2 scale 0.0 4,803,383,296 ( +0) 4,293,046,272 ( +0) ======
Also perhaps we should log or vary the max_tokens since that seems relevant?
All your statements above are true. I can summarize even further
- When no adapter is set, no memory increase
- when adapter is set inside or outside the loop and max_tokens=16, memory increases but at a small rate
- When adapter is set inside or outside the loop and max_tokens=256, memory increases by about 8Mb for each inference
Can you share how to run llama.cpp cmdline. I can run it on a gpu I have access to
Good idea to try the llama.cpp command line.
The compiled llama.cpp for the Python bindings is in the vendor subdirectory.
There is a normal llama.cpp cli but I'm not sure if it supports running multiple completions in a single session.
Perhaps you can try running the server and then calling it multiple times with curl or via the ui?
It's in the examples/server subdirectory.
https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md
You can load a LoRA with --lora or --lora-scaled. It should be possible to set the seed/max tokens etc to match the test case.
Hi @hrsmanian , here is a Bash scrip to test against llama.cpp.
First, compile the llama-server binary. This should be in the llama-cpp-python source directory.
cd vendor/llama.cpp/
make llama-server
Then run the below script, llama-server-memtest.sh.
#!/bin/bash
# Function to clean up server process
cleanup() {
local exit_code=$?
echo "Cleaning up..."
if [ ! -z "$SERVER_PID" ]; then
kill $SERVER_PID 2>/dev/null
wait $SERVER_PID 2>/dev/null
fi
exit $exit_code
}
# Set up trap for script exit
trap cleanup EXIT
# Start llama-server in background
./llama-server \
--model "$MODEL_GGUF" \
--lora "$ADAPTER1_GGUF" &
# Save server PID
SERVER_PID=$!
# Wait for server to start up
sleep 5
# Function to log memory usage
log_memory() {
local msg=$1
# Get virtual and resident memory in bytes
local mem=$(ps -o vsz=,rss= -p $SERVER_PID)
local vsz=$(echo $mem | cut -d' ' -f1)
local rss=$(echo $mem | cut -d' ' -f2)
# Convert to bytes (ps shows KB)
vsz=$((vsz * 1024))
rss=$((rss * 1024))
# Calculate deltas
if [ -z "$PREV_VSZ" ]; then
PREV_VSZ=$vsz
PREV_RSS=$rss
fi
local delta_vsz=$((vsz - PREV_VSZ))
local delta_rss=$((rss - PREV_RSS))
# Format with commas for readability
printf "====== %-40s %'16d (%+'16d) %'16d (%+'16d) ======\n" \
"$msg" $vsz $delta_vsz $rss $delta_rss
PREV_VSZ=$vsz
PREV_RSS=$rss
}
# Log initial memory state
log_memory "initial"
# Run completions in a loop
for i in {1..100}; do
curl --silent --request POST \
--url http://127.0.0.1:8080/completion \
--header "Content-Type: application/json" \
--data "{\"seed\":12345,\"max_tokens\":16,\"temperature\":0,\"prompt\": \"$i\"}" \
> /dev/null
log_memory "after completion \"$i\""
done
When I run it I get output like:
$ ./llama-server-memtest.sh 2>&1 | tee server-memtest.log
...run for awhile...
^C <interrupt>
$ cat server-memtest.log | grep ===
====== initial 11,023,482,880 ( +0) 8,471,502,848 ( +0) ======
====== after completion "1" 11,023,482,880 ( +0) 8,471,502,848 ( +0) ======
====== after completion "2" 11,090,591,744 ( +67,108,864) 8,471,764,992 ( +262,144) ======
====== after completion "3" 11,157,700,608 ( +67,108,864) 8,471,896,064 ( +131,072) ======
====== after completion "4" 11,224,809,472 ( +67,108,864) 8,471,896,064 ( +0) ======
====== after completion "5" 11,291,918,336 ( +67,108,864) 8,471,896,064 ( +0) ======
====== after completion "6" 11,291,918,336 ( +0) 8,471,896,064 ( +0) ======
====== after completion "7" 11,291,918,336 ( +0) 8,471,896,064 ( +0) ======
====== after completion "8" 11,359,027,200 ( +67,108,864) 8,471,896,064 ( +0) ======
====== after completion "9" 11,359,027,200 ( +0) 8,471,896,064 ( +0) ======
====== after completion "10" 11,359,027,200 ( +0) 8,471,896,064 ( +0) ======
====== after completion "11" 11,359,027,200 ( +0) 8,472,027,136 ( +131,072) ======
====== after completion "12" 11,359,027,200 ( +0) 8,472,027,136 ( +0) ======
====== after completion "13" 11,359,027,200 ( +0) 8,472,027,136 ( +0) ======
====== after completion "14" 11,359,027,200 ( +0) 8,472,027,136 ( +0) ======
...
====== after completion "28" 11,359,027,200 ( +0) 8,472,027,136 ( +0) ======
====== after completion "29" 11,359,027,200 ( +0) 8,472,027,136 ( +0) ======
====== after completion "30" 11,359,027,200 ( +0) 8,472,027,136 ( +0) ======
There is memory growth, but it stabilises after awhile. The server might allocate IO buffers, perhaps it's doing caching, etc. It probably needs more analysis to know if there is a leak. I thought I'd share the script so you can look at GPU memory usage. For a really pure reproduction, we may need to write C++ code that uses the plain llama.cpp API, but testing with the llama-server app first is a good start.
Any progress on this? This would be a really helpful feature.
Can LoRA hotswapping more effective than reload model with LoRA adapter ? In my test, reload model is enough for me.🤔