gpt4all backend doesn't respect gpu_layers config
LocalAI version:
image tag : v2.10.1-cublas-cuda12-core
Environment, CPU architecture, OS, and Version:
uname : Linux worker002 6.2.0-26-generic #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Jul 13 16:27:29 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
kubernetes : v1.24.7
os : ubuntu 22.04
k8s deployment :
apiVersion: apps/v1
kind: Deployment
metadata:
name: localai
labels:
app: localai
spec:
replicas: 1
selector:
matchLabels:
app: localai
template:
metadata:
labels:
app: localai
spec:
nodeName: "worker002"
runtimeClassName: nvidia
containers:
- name: localai
image: images.xxx.com:30443/localai:v2.10.1-cublas-cuda12-core
ports:
- containerPort: 8080
env:
- name: DEBUG
value: "true"
- name: F16
value: "true"
- name: CONFIG_FILE
value: "/build/configuration/config.yaml"
resources:
requests:
cpu: 2
memory: 12Gi
gpu.xxx.com/GA102_GEFORCE_RTX_3080: 1
limits:
cpu: 2
memory: 12Gi
gpu.xxx.com/GA102_GEFORCE_RTX_3080: 1
volumeMounts:
- name: models
mountPath: /build/models/
- name: config
mountPath: /build/configuration/
volumes:
- name: models
hostPath:
path: /zfspv-pool/localai/models/
type: Directory
- name: config
hostPath:
path: /zfspv-pool/localai/configuration/
Describe the bug
The config file :
- name: hello
parameters:
model: koala-7b-ggml-q4_0.bin
context_size: 1024
gpu_layers: 32
f16: true
cuda: true
gpu_memory_utilization: 1
- name: lunna-ai
parameters:
model: luna-ai-llama2-uncensored.Q4_0.gguf
context_size: 1024
gpu_layers: 32
f16: true
cuda: true
backend: llama
Send request:
root@master01:~/cxl# curl http://100.68.200.132/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "hello",
"messages": [{"role": "user", "content": "How are you?"}],
"temperature": 0.9,
"language":"en"
}'
When I use luna-ai-llama2-uncensored.Q4_0.gguf model with llama backend, gpu_layers works well as I can see the gpu utilization by nvtop command, and it takes few seconds to get response.
When I use koala-7b-ggml-q4_0.bin model, local ai choose gpt4all as backend automatically and seems it didn't use GPU but CPU to inference(nvtop shows zero GPU utilization and the request stucked for a long time).
Expected behavior
From the documents, I can only found those config items :
- f16
- gpu_layers
- gpu_memory_utilization
- cuda
IDK wha'ts the correct way to use GPU with variously backend, because LocalAI just choose backends automatically. What I expect is gpu_layers works for all text generation backends.
Logs
Logs as follow :
5:14PM INF [gpt4all] Attempting to load
5:14PM INF Loading model 'koala-7b-ggml-q4_0.bin' with backend gpt4all
5:14PM DBG Loading model in memory from file: /build/models/koala-7b-ggml-q4_0.bin
5:14PM DBG Loading Model koala-7b-ggml-q4_0.bin with gRPC (file: /build/models/koala-7b-ggml-q4_0.bin) (backend: gpt4all): {backendString:gpt4all model:koala-7b-ggml-q4_0.bin threads:4 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc000154000 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh coqui:/build/backend/python/coqui/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh mamba:/build/backend/python/mamba/run.sh petals:/build/backend/python/petals/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false parallelRequests:false}
5:14PM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/gpt4all
5:14PM DBG GRPC Service for koala-7b-ggml-q4_0.bin will be running at: '127.0.0.1:33083'
5:14PM DBG GRPC Service state dir: /tmp/go-processmanager4223099298
5:14PM DBG GRPC Service Started
5:14PM DBG GRPC(koala-7b-ggml-q4_0.bin-127.0.0.1:33083): stderr 2024/03/27 17:14:45 gRPC Server listening at 127.0.0.1:33083
5:14PM DBG GRPC Service Ready
5:14PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:koala-7b-ggml-q4_0.bin ContextSize:1024 Seed:880999533 NBatch:512 F16Memory:true MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:32 MainGPU: TensorSplit: Threads:4 LibrarySearchPath:/tmp/localai/backend_data/backend-assets/gpt4all RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/build/models/koala-7b-ggml-q4_0.bin Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:true CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryUtilization:1 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type:}
5:14PM DBG GRPC(koala-7b-ggml-q4_0.bin-127.0.0.1:33083): stderr llama.cpp: loading model from /build/models/koala-7b-ggml-q4_0.bin
5:14PM DBG GRPC(koala-7b-ggml-q4_0.bin-127.0.0.1:33083): stderr llama_model_load_internal: format = ggjt v1 (latest)
5:14PM DBG GRPC(koala-7b-ggml-q4_0.bin-127.0.0.1:33083): stderr llama_model_load_internal: n_vocab = 32000
5:14PM DBG GRPC(koala-7b-ggml-q4_0.bin-127.0.0.1:33083): stderr llama_model_load_internal: n_ctx = 2048
5:14PM DBG GRPC(koala-7b-ggml-q4_0.bin-127.0.0.1:33083): stderr llama_model_load_internal: n_embd = 4096
5:14PM DBG GRPC(koala-7b-ggml-q4_0.bin-127.0.0.1:33083): stderr llama_model_load_internal: n_mult = 256
5:14PM DBG GRPC(koala-7b-ggml-q4_0.bin-127.0.0.1:33083): stderr llama_model_load_internal: n_head = 32
5:14PM DBG GRPC(koala-7b-ggml-q4_0.bin-127.0.0.1:33083): stderr llama_model_load_internal: n_layer = 32
5:14PM DBG GRPC(koala-7b-ggml-q4_0.bin-127.0.0.1:33083): stderr llama_model_load_internal: n_rot = 128
5:14PM DBG GRPC(koala-7b-ggml-q4_0.bin-127.0.0.1:33083): stderr llama_model_load_internal: ftype = 2 (mostly Q4_0)
5:14PM DBG GRPC(koala-7b-ggml-q4_0.bin-127.0.0.1:33083): stderr llama_model_load_internal: n_ff = 11008
5:14PM DBG GRPC(koala-7b-ggml-q4_0.bin-127.0.0.1:33083): stderr llama_model_load_internal: n_parts = 1
5:14PM DBG GRPC(koala-7b-ggml-q4_0.bin-127.0.0.1:33083): stderr llama_model_load_internal: model size = 7B
5:14PM DBG GRPC(koala-7b-ggml-q4_0.bin-127.0.0.1:33083): stderr llama_model_load_internal: ggml ctx size = 59.11 KB
5:14PM DBG GRPC(koala-7b-ggml-q4_0.bin-127.0.0.1:33083): stderr llama_model_load_internal: mem required = 5809.32 MB (+ 1026.00 MB per state)
5:14PM DBG GRPC(koala-7b-ggml-q4_0.bin-127.0.0.1:33083): stderr llama_init_from_file: kv self size = 1024.00 MB
5:14PM INF [gpt4all] Loads OK
Additional context
Hello, I also had a similar problem, did you solve it
Hello, I also had a similar problem, did you solve it
No yet
I apologize for the inconvenience. I'm currently investigating the issue you're experiencing with the koala-7b-ggml-q4_0.bin model and not utilizing the GPU. I will update this thread as soon as I find a solution.
Relevant information or alternative solutions will be appreciated.