llama-cpp-python
llama-cpp-python copied to clipboard
Getting seg faults intermittently prior to streaming generation
- Physical (or virtual) hardware you are using, e.g. for Linux:
user@dev0:~$ lscpu
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Vendor ID: ARM
Model name: -
Model: Rockchip RK3588S
Thread(s) per core: 0
Core(s) per socket: 0
Socket(s): 0
Stepping: 0x2
BogoMIPS: 48.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp
- Operating System, e.g. for Linux:
user@dev0:~$ uname -a
Linux rock-5c 6.1.43-20-rk2312 #3e26818dc SMP Tue Nov 19 07:21:24 UTC 2024 aarch64 GNU/Linux
- SDK version, e.g. for Linux:
user@dev0:~$ python3 --version
Python 3.11.2
user@dev0:~$ make --version
GNU Make 4.3
Built for aarch64-unknown-linux-gnu
user@dev0:~$ g++ --version
g++ (Debian 12.2.0-14) 12.2.0
Failure Information (for bugs)
- In my app I am streaming user input to the model and using generate with max_tokens=1 to load data into the cache (I am using this instead of llama.eval because inputs can be edited, eval is strictly additive from my understanding). After user finishes input then the actual model inference happens. Interestingly, the seg faults are only happening for the full generation, not the generations called with max_tokens=1 for caching. Below is output where user edited input, then after submitting for full generation a seg fault happens. This is an intermittent issue, it happens every 10-20 generations.
Llama.generate: 6573 prefix-match hit, remaining 15 prompt tokens to eval
llama_perf_context_print: load time = 2135.45 ms
llama_perf_context_print: prompt eval time = 1698.20 ms / 15 tokens ( 113.21 ms per token, 8.83 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 1714.09 ms / 16 tokens
Llama.generate: 6572 prefix-match hit, remaining 22 prompt tokens to eval
Llama.generate: 6572 prefix-match hit, remaining 38 prompt tokens to eval
Segmentation fault
Steps to Reproduce
Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.
- Instantiate model
self.llm = Llama(
model_path=model_path, #Llama 3.2 3B Q4_0
seed=1,
n_threads=4,
n_ctx=10000,
temperature=1,
verbose=True
)
- Generate with max_tokens=1 until user is finished inputting
for output in self.llm(
my_context,
max_tokens=1,
stream=True,
temperature=1,
):
log.debug(f"generated single token to cache input")
- Attempt full generation, segmentation fault occurs before output is generated
def output_generator():
log.info("Starting output generation : def output_generator()")
for output in self.llm(
my_context,
seed=random.randint(1, 1000000),
stream=True,
max_tokens=650,
temperature=1,
stop=stop_generation_strings,
):
token = output['choices'][0]['text']
yield token
Any ideas for potential workarounds or solutions are welcome, thanks