Not able to reach the Ollama local model hosted on another machine
Issue Category
Undocumented feature or missing documentation
Affected Documentation Page URL
No response
Issue Description
I have the following config.json:
{
"models": [
{
"title": "Qwen 2.5 Coder 7b",
"model": "qwen-2.5-coder-instruct-7b",
"provider": "ollama",
"apiBase": "http://192.168.120.243:9000/v1/chat/completions"
}
],
"contextProviders": [
{
"name": "code",
"params": {}
},
{
"name": "docs",
"params": {}
},
{
"name": "diff",
"params": {}
},
{
"name": "terminal",
"params": {}
},
{
"name": "problems",
"params": {}
},
{
"name": "folder",
"params": {}
},
{
"name": "codebase",
"params": {}
}
],
"slashCommands": [
{
"name": "share",
"description": "Export the current chat session to markdown"
},
{
"name": "cmd",
"description": "Generate a shell command"
},
{
"name": "commit",
"description": "Generate a git commit message"
}
],
"tabAutocompleteModel": {
"title": "Qwen 2.5b Autocomplete Model",
"provider": "ollama",
"model": "qwen-2.5-coder-instruct-7b",
"apiBase": "http://192.168.120.243:9000/v1/"
},
"data": [],
"docs": [
{
"startUrl": "https://requests.readthedocs.io",
"title": "requests"
}
]
}
I am not being to receive a response (error 404) Here is the Ollama server logs:
time=2025-06-09T14:52:06.479Z level=INFO source=types.go:130 msg="inference compute" id=GPU-1f0828c4-2144-92a0-a19b-ece2a193546
a library=cuda variant=v12 compute=12.0 driver=12.8 name="NVIDIA GeForce RTX 5090" total="31.4 GiB" available="2.7 GiB"
time=2025-06-09T14:52:06.479Z level=INFO source=types.go:130 msg="inference compute" id=GPU-a4326111-a43d-9b34-3414-701320dafc9
5 library=cuda variant=v12 compute=12.0 driver=12.8 name="NVIDIA GeForce RTX 5090" total="31.4 GiB" available="9.6 GiB"
time=2025-06-09T14:52:06.479Z level=INFO source=types.go:130 msg="inference compute" id=GPU-5d59eeb5-c7c3-5bd7-6917-a1e7bac7a9f
e library=cuda variant=v12 compute=12.0 driver=12.8 name="NVIDIA GeForce RTX 5090" total="31.4 GiB" available="10.1 GiB"
time=2025-06-09T14:52:06.479Z level=INFO source=types.go:130 msg="inference compute" id=GPU-1ce4c642-299f-18b4-78b1-497671a6852
b library=cuda variant=v12 compute=12.0 driver=12.8 name="NVIDIA GeForce RTX 5090" total="31.4 GiB" available="9.2 GiB"
[GIN] 2025/06/09 - 15:01:40 | 404 | 808.436µs | 192.168.120.28 | POST "/api/show"
[GIN] 2025/06/09 - 15:01:40 | 404 | 340.388µs | 192.168.120.28 | POST "/api/show"
[GIN] 2025/06/09 - 15:01:45 | 404 | 985.237µs | 192.168.120.28 | POST "/api/chat"
[GIN] 2025/06/09 - 15:01:59 | 404 | 8.957µs | 192.168.120.28 | POST "/v1/api/show"
[GIN] 2025/06/09 - 15:02:00 | 404 | 4.077µs | 192.168.120.28 | POST "/v1/api/show"
[GIN] 2025/06/09 - 15:02:05 | 404 | 6.563µs | 192.168.120.28 | POST "/v1/api/chat"
[GIN] 2025/06/09 - 15:02:35 | 404 | 11.732µs | 192.168.120.28 | POST "/v1/api/show"
[GIN] 2025/06/09 - 15:02:36 | 404 | 4.057µs | 192.168.120.28 | POST "/v1/api/show"
[GIN] 2025/06/09 - 15:02:42 | 404 | 7.965µs | 192.168.120.28 | POST "/v1/api/show"
[GIN] 2025/06/09 - 15:02:43 | 404 | 7.394µs | 192.168.120.28 | POST "/v1/api/show"
[GIN] 2025/06/09 - 15:02:51 | 404 | 7.253µs | 192.168.120.28 | POST "/v1/api/chat"
[GIN] 2025/06/09 - 15:03:23 | 404 | 8.096µs | 192.168.120.28 | POST "/v1/api/chat"
[GIN] 2025/06/09 - 15:03:44 | 404 | 7.905µs | 192.168.120.28 | POST "/v1/chat/api/show"
[GIN] 2025/06/09 - 15:03:44 | 404 | 4.258µs | 192.168.120.28 | POST "/v1/chat/api/show"
[GIN] 2025/06/09 - 15:03:47 | 404 | 8.836µs | 192.168.120.28 | POST "/v1/chat/api/chat"
[GIN] 2025/06/09 - 15:03:55 | 404 | 5.08µs | 192.168.120.28 | POST "/v1/chat/completions/api/show"
[GIN] 2025/06/09 - 15:03:56 | 404 | 4.328µs | 192.168.120.28 | POST "/v1/chat/completions/api/show"
[GIN] 2025/06/09 - 15:03:56 | 404 | 3.457µs | 192.168.120.28 | POST "/v1/chat/completions/api/chat"
...
I tried using /v1, /v1/ and /v1/chat/completions but none of them worked. The extension is requesting /api/chat and /api/show endpoints. There are nothing in the documentation mentioning how to handle this.
Expected Content
Detail how to precisely specify the URL of a locally hosted model, with the proper endpoints. Better to have an end-to-end config.json or config.yaml example. Thanks!
This only worked when using offline model with Llama cpp. The config.json file became:
{
"models": [
{
"title": "Qwen 2.5 Coder 7b",
"model": "qwen-2.5-coder-instruct-7b",
"provider": "llama.cpp",
"apiBase": "http://192.168.120.243:8078"
}
],
"contextProviders": [
{
"name": "code",
"params": {}
},
{
"name": "docs",
"params": {}
},
{
"name": "diff",
"params": {}
},
{
"name": "terminal",
"params": {}
},
{
"name": "problems",
"params": {}
},
{
"name": "folder",
"params": {}
},
{
"name": "codebase",
"params": {}
}
],
"slashCommands": [
{
"name": "share",
"description": "Export the current chat session to markdown"
},
{
"name": "cmd",
"description": "Generate a shell command"
},
{
"name": "commit",
"description": "Generate a git commit message"
}
],
"tabAutocompleteModel": {
"title": "Qwen 2.5b Autocomplete Model",
"provider": "llama.cpp",
"model": "qwen-2.5-coder-instruct-7b",
"apiBase": "http://192.168.120.243:8078"
},
"data": [],
"docs": [
{
"startUrl": "https://requests.readthedocs.io",
"title": "requests"
}
]
}
Weird enough, when the provider is ollama, and the apiBase points to a running ollama model and its correct port, the extension tries to request /api/generate and /api/show, which does not appear to be compatible with ollama (404 not found). Now, I think this is a bug with the extension when the provider is set to ollama.
/api/generate is documented here for Ollama: https://github.com/ollama/ollama/blob/main/docs/api.md#generate-a-completion
but you're right that llama.cpp wouldn't support that. I think it probably just comes down to the apiBase. In the original config you shared, you should strip the /v1/... and make it just the host and port
I tried it with and without /v1, same result. No, the issue is not in llama.cpp, llama.cpp provider worked, but ollama didn't work. I have ollama installed on a server, and when pointing to the ollama URL on that server it doesn't work.
is there an update on this or a work around that someone's figured out?
Hi @msharara1998 ,
Could you confirm that apiBase is set to http://192.168.120.243:9000 in both models and tabAutocompleteModel?
Also, note that the correct model name string in Ollama appears to be qwen2.5-coder:7b-instruct, not qwen-2.5-coder-instruct-7b.
If you still encounter issues, it would be appreciated when you can share the logs from your IDE's Developer Tools/ Logs.
Repro is very easy, just setup ollama on a 2nd machine, then set that machine as the apiBase and it will not work. The log is showing it's getting a 404, the reason is because it's tacking on /api/ to the URL which isn't valid.
I've written a Go program to forawrd all network traffic on localhost:11434 to the remote machine to pretend that there is ollama running locally, that works fine.
This issue hasn't been updated in 90 days and will be closed after an additional 10 days without activity. If it's still important, please leave a comment and share any new information that would help us address the issue.
This issue was closed because it wasn't updated for 10 days after being marked stale. If it's still important, please reopen + comment and we'll gladly take another look!