LocalAI Fails: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF

LocalAI version:

localai/localai:latest-aio-cpu

Environment, CPU architecture, OS, and Version:

cpu

Describe the bug

api_1 | 8:39AM INF [llama-cpp] Fails: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF api_1 | 8:39AM INF [llama-ggml] Attempting to load api_1 | 8:39AM INF Loading model with backend llama-ggml api_1 | 8:39AM DBG Loading model in memory from file: /build/models

To Reproduce

Expected behavior

Logs

forceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type: FlashAttention:false NoKVOffload:false} api_1 | 8:39AM INF [llama-cpp] Fails: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF api_1 | 8:39AM INF [llama-ggml] Attempting to load api_1 | 8:39AM INF Loading model with backend llama-ggml api_1 | 8:39AM DBG Loading model in memory from file: /build/models

Additional context

Jul 25 '24 08:07 zjialin

yup same regardless of install method.

Aug 17 '24 23:08 OzSpots-Wireless

same here

Sep 16 '24 02:09 rxcca

Same here

Sep 20 '24 07:09 jimiopex

Did this happen with a specific model for you? For me it was command r

Nov 25 '24 20:11 Chepko932

deepseek-r1-distill-llama-8b using localai/localai:latest-aio-gpu-nvidia-cuda-12 docker image

After downloading the 44GB image, I am still unable to get this to work.

5:20AM INF Trying to load the model 'deepseek-r1-distill-llama-8b' with the backend '[llama-cpp llama-ggml llama-cpp-fallback stablediffusion-ggml whisper bark-cpp piper stablediffusion silero-vad huggingface /build/backend/python/exllama2/run.sh /build/backend/python/transformers-musicgen/run.sh /build/backend/python/sentencetransformers/run.sh /build/backend/python/diffusers/run.sh /build/backend/python/transformers/run.sh /build/backend/python/bark/run.sh /build/backend/python/vllm/run.sh /build/backend/python/autogptq/run.sh /build/backend/python/mamba/run.sh /build/backend/python/coqui/run.sh /build/backend/python/vall-e-x/run.sh /build/backend/python/sentencetransformers/run.sh /build/backend/python/openvoice/run.sh /build/backend/python/rerankers/run.sh /build/backend/python/parler-tts/run.sh]'

5:20AM INF [llama-cpp] Attempting to load
5:20AM INF Loading model 'deepseek-r1-distill-llama-8b' with backend llama-cpp
WARNING: failed to read int from file: open /sys/class/drm/card0/device/numa_node: no such file or directory
WARNING: error parsing the pci address "simple-framebuffer.0"

5:20AM ERR [llama-cpp] Failed loading model, trying with fallback 'llama-cpp-fallback', error: failed to load model with internal loader: could not load model: rpc error: code = Canceled desc = 

5:20AM INF [llama-cpp] Fails: failed to load model with internal loader: could not load model: rpc error: code = Canceled desc = 

5:20AM INF [llama-ggml] Attempting to load
5:20AM INF Loading model 'deepseek-r1-distill-llama-8b' with backend llama-ggml
5:20AM INF [llama-ggml] Fails: failed to load model with internal loader: could not load model: rpc error: code = Unknown desc = failed loading model
5:20AM INF [llama-cpp-fallback] Attempting to load
5:20AM INF Loading model 'deepseek-r1-distill-llama-8b' with backend llama-cpp-fallback

Feb 10 '25 05:02 Hello-World-Traveler

I notice in the logs: failed: out of memory, how ever the needed memory is available .

3:07AM DBG GRPC(intellect-1-instruct-127.0.0.1:37827): stderr ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1344.00 MiB on device 0: cudaMalloc failed: out of memory

3:07AM DBG GRPC(intellect-1-instruct-127.0.0.1:37827): stderr llama_kv_cache_init: failed to allocate buffer for kv cache

3:07AM DBG GRPC(intellect-1-instruct-127.0.0.1:37827): stderr llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache

3:07AM DBG GRPC(intellect-1-instruct-127.0.0.1:37827): stderr common_init_from_params: failed to create context with model '/build/models/INTELLECT-1-Instruct-Q4_K_M.gguf'

3:07AM ERR [llama-cpp] Failed loading model, trying with fallback 'llama-cpp-fallback', error: failed to load model with internal loader: could not load model: rpc error: code = Canceled desc =

@mudler is this the model or the server?

Feb 13 '25 03:02 Hello-World-Traveler

Localai functioncall phi 4 v0.3

LocalAI Version v2.26.0

7:25AM INF [stablediffusion-ggml] Fails: failed to load model with internal loader: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF

7:25AM INF [whisper] Attempting to load

7:25AM INF BackendLoader starting backend=whisper modelID=LocalAI-functioncall-phi-4-v0.3 o.model=localai-functioncall-phi-4-v0.3-q4_k_m.gguf

7:25AM DBG Loading model in memory from file: /build/models/localai-functioncall-phi-4-v0.3-q4_k_m.gguf

7:25AM DBG Loading Model LocalAI-functioncall-phi-4-v0.3 with gRPC (file: /build/models/localai-functioncall-phi-4-v0.3-q4_k_m.gguf) (backend: whisper): {backendString:whisper model:localai-functioncall-phi-4-v0.3-q4_k_m.gguf modelID:LocalAI-functioncall-phi-4-v0.3 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc0003a6008 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh coqui:/build/backend/python/coqui/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh mamba:/build/backend/python/mamba/run.sh openvoice:/build/backend/python/openvoice/run.sh parler-tts:/build/backend/python/parler-tts/run.sh rerankers:/build/backend/python/rerankers/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false parallelRequests:false}

7:25AM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/whisper

7:25AM DBG GRPC Service for LocalAI-functioncall-phi-4-v0.3 will be running at: '127.0.0.1:37653'

7:25AM DBG GRPC Service state dir: /tmp/go-processmanager3335784722

7:25AM DBG GRPC Service Started

7:25AM DBG Wait for the service to start up

7:25AM DBG Options: ContextSize:4096 Seed:435165384 NBatch:512 F16Memory:true MMap:true NGPULayers:99999999 Threads:10

7:25AM DBG GRPC(LocalAI-functioncall-phi-4-v0.3-127.0.0.1:37653): stderr 2025/02/17 07:25:30 gRPC Server listening at 127.0.0.1:37653

7:25AM DBG GRPC Service Ready

7:25AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:0xc00075d958} sizeCache:0 unknownFields:[] Model:localai-functioncall-phi-4-v0.3-q4_k_m.gguf ContextSize:4096 Seed:435165384 NBatch:512 F16Memory:true MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:99999999 MainGPU: TensorSplit: Threads:10 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/build/models/localai-functioncall-phi-4-v0.3-q4_k_m.gguf Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 LoadFormat: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type: FlashAttention:false NoKVOffload:false ModelPath:/build/models LoraAdapters:[] LoraScales:[] Options:[] CacheTypeKey: CacheTypeValue: GrammarTriggers:[]}

7:25AM DBG GRPC(LocalAI-functioncall-phi-4-v0.3-127.0.0.1:37653): stderr whisper_init_from_file_with_params_no_state: loading model from '/build/models/localai-functioncall-phi-4-v0.3-q4_k_m.gguf'

Feb 17 '25 08:02 Hello-World-Traveler

I got deepseek-r1-distill-llama-8b working by removing the /tmp mount.

I think LocalAi isn't unloading the models when the user changes it as a restart makes the model work (most models). We need a unload button and better error handling.

Feb 18 '25 02:02 Hello-World-Traveler

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Jul 28 '25 02:07 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

Aug 03 '25 02:08 github-actions[bot]

That's lovely, arbitrary github bot, but you didn't solve the issue.

Oct 24 '25 19:10 johnwbyrd