llama-cpp-python
llama-cpp-python copied to clipboard
Segmentation fault when converting embeddings into tensor
trying to convert embedding into tensor leads to Segmentation fault:
System Info
- Physical (or virtual) hardware you are using, e.g. for Linux:
> sysctl -a | grep machdep.cpu
machdep.cpu.cores_per_package: 10
machdep.cpu.core_count: 10
machdep.cpu.logical_per_package: 10
machdep.cpu.thread_count: 10
machdep.cpu.brand_string: Apple M2 Pro
- Operating System, e.g. for Linux:
macos Sequia 15.3.1 (24D70)
- SDK version, e.g. for Linux:
Python 3.13.2
GNU Make 3.81
Apple clang version 16.0.0 (clang-1600.0.26.6)
Target: arm64-apple-darwin24.3.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
code
import logging
import torch
from llama_cpp import Llama
from rich.console import Console
# Configure logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
console = Console(width=120)
embpath = "all-MiniLM-L6-v2-ggml-model-f16.gguf"
embedModel = Llama(model_path=embpath,embedding=True,verbose=True)
# test embedding model
query = ["Test sentence"]
try:
embeds = embedModel.embed(input=query)
print(embeds)
genAns_tensor = torch.tensor(embeds)
del embedModel
except Exception as e:
print("Embedding error:", e)
code works for only creating embeddings ( i.e if i remove the tensor conversion part and just print the embedding)
logs
llama_kv_cache_init: kv_size = 512, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 6, can_shift = 1
llama_kv_cache_init: layer 0: n_embd_k_gqa = 384, n_embd_v_gqa = 384
llama_kv_cache_init: layer 1: n_embd_k_gqa = 384, n_embd_v_gqa = 384
llama_kv_cache_init: layer 2: n_embd_k_gqa = 384, n_embd_v_gqa = 384
llama_kv_cache_init: layer 3: n_embd_k_gqa = 384, n_embd_v_gqa = 384
llama_kv_cache_init: layer 4: n_embd_k_gqa = 384, n_embd_v_gqa = 384
llama_kv_cache_init: layer 5: n_embd_k_gqa = 384, n_embd_v_gqa = 384
llama_kv_cache_init: Metal KV buffer size = 4.50 MiB
llama_init_from_model: KV self size = 4.50 MiB, K (f16): 2.25 MiB, V (f16): 2.25 MiB
llama_init_from_model: CPU output buffer size = 0.00 MiB
llama_init_from_model: Metal compute buffer size = 17.00 MiB
llama_init_from_model: CPU compute buffer size = 3.50 MiB
llama_init_from_model: graph nodes = 221
llama_init_from_model: graph splits = 2
Metal : EMBED_LIBRARY = 1 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | DOTPROD = 1 | MATMUL_INT8 = 1 | ACCELERATE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
Model metadata: {'tokenizer.ggml.cls_token_id': '101', 'tokenizer.ggml.padding_token_id': '0', 'tokenizer.ggml.seperator_token_id': '102', 'tokenizer.ggml.unknown_token_id': '100', 'tokenizer.ggml.token_type_count': '2', 'general.file_type': '1', 'tokenizer.ggml.eos_token_id': '102', 'bert.context_length': '512', 'bert.pooling_type': '1', 'tokenizer.ggml.bos_token_id': '101', 'bert.attention.head_count': '12', 'bert.feed_forward_length': '1536', 'tokenizer.ggml.mask_token_id': '103', 'tokenizer.ggml.model': 'bert', 'bert.attention.causal': 'false', 'general.name': 'all-MiniLM-L6-v2', 'bert.block_count': '6', 'bert.attention.layer_norm_epsilon': '0.000000', 'bert.embedding_length': '384', 'general.architecture': 'bert'}
Using fallback chat format: llama-2
Fatal Python error: Segmentation fault
Thread 0x0000000204908840 (most recent call first):
File "/Users/devashishraj/Desktop/localRAG/lrag/lib/python3.13/site-packages/llama_cpp/_internals.py", line 306 in decode
File "/Users/devashishraj/Desktop/localRAG/lrag/lib/python3.13/site-packages/llama_cpp/llama.py", line [1] 63839 segmentation fault PYTHONFAULTHANDLER=1 python3 -X dev embeddingTest.py
venv package list
pip list
Package Version
------------------------ -----------
aiohappyeyeballs 2.4.4
aiohttp 3.11.10
aiosignal 1.3.2
annotated-types 0.7.0
anyio 4.7.0
attrs 24.3.0
beautifulsoup4 4.12.3
certifi 2024.12.14
charset-normalizer 3.4.0
dataclasses-json 0.6.7
diskcache 5.6.3
faiss-cpu 1.9.0.post1
filelock 3.17.0
frozenlist 1.5.0
fsspec 2025.2.0
gpt4all 2.8.2
h11 0.14.0
httpcore 1.0.7
httpx 0.28.1
httpx-sse 0.4.0
huggingface-hub 0.28.1
idna 3.10
Jinja2 3.1.5
joblib 1.4.2
jsonpatch 1.33
jsonpointer 3.0.0
langchain 0.3.12
langchain-community 0.3.12
langchain-core 0.3.33
langchain-ollama 0.2.3
langchain-text-splitters 0.3.3
langsmith 0.2.3
llama_cpp_python 0.3.7
markdown-it-py 3.0.0
MarkupSafe 3.0.2
marshmallow 3.23.1
mdurl 0.1.2
mpmath 1.3.0
multidict 6.1.0
mypy-extensions 1.0.0
networkx 3.4.2
numpy 2.2.3
ollama 0.4.7
orjson 3.10.12
packaging 24.2
pillow 11.1.0
pip 25.0.1
propcache 0.2.1
pydantic 2.10.3
pydantic_core 2.27.1
pydantic-settings 2.7.0
Pygments 2.18.0
PyMuPDF 1.25.1
python-dotenv 1.0.1
PyYAML 6.0.2
regex 2024.11.6
requests 2.32.3
requests-toolbelt 1.0.0
rich 13.9.4
safetensors 0.5.2
scikit-learn 1.6.1
scipy 1.15.1
sentence-transformers 3.4.1
setuptools 75.8.0
sniffio 1.3.1
soupsieve 2.6
SQLAlchemy 2.0.36
sympy 1.13.1
tenacity 9.0.0
threadpoolctl 3.5.0
tiktoken 0.8.0
tokenizers 0.21.0
torch 2.6.0
tqdm 4.67.1
transformers 4.48.3
typing_extensions 4.12.2
typing-inspect 0.9.0
urllib3 2.2.3
yarl 1.18.3