When use gpt-oss-120B by vLLM locally, opencode doesn't call the tools
Description
When I try use local model (such as gpt-oss-120b and qwen3-32b), there is only content of thinking in response, no tools calling(even if model thinks it should call tools from thinking content) and no other response. But sometimes it be normally(very rare). This is very strange.
Plugins
No response
OpenCode version
1.1.4
Steps to reproduce
opencode.json:
vllm serve:
Screenshot and/or share link
No response
Operating System
No response
Terminal
No response
This issue might be a duplicate of existing issues. Please check:
- #7083: Using local Ollama models doesnt return any results - same symptoms with local models returning only JSON tool calls instead of actual responses
- #6649: Weird error with toolcalling on local model - similar tool calling issues with local models via Llama.cpp
- #5694: Local Ollama models are not agentic - local models not able to make tool calls properly
- #6223: Model not returning the answers - Qwen2.5-coder with Ollama not responding, only showing JSON tool calls
- #4428: Why is opencode not working with local llms via Ollama - widespread issue with local LLMs and tool calling
- #4255: OpenCode v1.0.25 Hangs Indefinitely with LM Studio + Qwen Models - detailed analysis of empty tool_calls array handling issue with local models
- #234: Tool Calling Issues with Open Source Models - discusses case sensitivity and tool calling inconsistencies with open source models
All of these report similar symptoms: local models (gpt-oss, qwen variants) via vLLM/Ollama either failing to call tools, only returning thinking content, or hanging indefinitely.
Feel free to ignore if none of these address your specific case.
I found there is no introduction about vLLM in official docs. Whether vLLM is not supported well ?
I have running similar issue, as described here: https://github.com/anomalyco/opencode/issues/7083 and it seems that locall llms from ollama are not capable of code assistant tasks in Opencode.
What hardware are you using, btw ?
a A100, linux. I'm not sure whether this is a problem with the model or the startup parameters. Or it need other plugins ?
I am using 'GPT-OSS' on Ollama on local machine. I have ollama setup on Ubuntu 22 LTS with a Nvidia 3090. I have it already setup with 'OpenWebUI'. I started using 'opencode' yesterday. And 'opencode' is realy a big step forward in AI in my perspective. I was hooked with the agentic work that Network Chuck showed on his channel: https://www.youtube.com/watch?v=MsQACpcuTkU It is preetty fantastic if this could work. I was realy fascinated, when i have seen on the above video, how better an experience can be achieved. Especialy if i can put agentic work to work on my '.md' files that i have saved on 'NextCloud'.
And i would realy like if this would also work with local ollama models like gpt-oss. ollama shows me that tools are also supported with 'Mistral'. Which i also could not get to work.
Are there any more detailed tutorials how to get this to work? Is this even an 'opencode' issue or is this a model issue? I would be very happy if someone can point me in a direction how i can get this to work. A lot of hat-downs to the creators of 'opencode'. this is realy fantastic.
if u are using vllm i thought there was a long open issue about tool calling not working through vllm for like months w/ gpt oss
This is a bad news. Is there a link to this issue? And is there a plan to address this issue?
Had the same issue. Was a problem of the context window. Ollama uses a default context window of 2048, which isn't enough to put all tool descriptions and examples schemas in. Create a Modelfile for gpt-oss:120b like this:
FROM gpt-oss:120b
PARAMETER num_ctx 16384
PARAMETER temperature 1
then run:
ollama create gpt-oss:120b-16384 -f Modelfile
This should ensure that the model has sufficient context to consider all tool definitions. If this doesn't work, increase to 32k
Thank you, but I am not using ollama, and when I started the model with vLLM, I set the max-model-len to 40960
Hi, @rekram1-node. Can using sglang solve this problem?
I am also having this issue but have 128k context on a DGX spark,
my vLLM
services:
vllm:
image: nvcr.io/nvidia/vllm:25.11-py3
restart: always
container_name: vllm-gpt-oss-120b
ports:
- "8000:8000"
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
environment:
VLLM_USE_NATIVE_FP4: "1"
VLLM_FP4_IMPL: "nvfp4"
ipc: host
ulimits:
memlock: -1
stack: 67108864
volumes:
- ./model-cache:/root/.cache/huggingface
- ./vllm-cache:/root/.cache/vllm
command: >
vllm serve openai/gpt-oss-120b
--host 0.0.0.0
--port 8000
--quantization mxfp4
--trust-remote-code
--gpu-memory-utilization 0.90
--max-model-len 131072
--enable-prefix-caching
--enable-auto-tool-choice
--tool-call-parser openai
I tried the following methods, and they seem to be effective.
@inv1s10n is it working well? What vllm version are u using??
I recently have no issues with its use. My vllm version is 0.14.0rc1.
Had the same issue. Was a problem of the context window. Ollama uses a default context window of 2048, which isn't enough to put all tool descriptions and examples schemas in. Create a
Modelfilefor gpt-oss:120b like this:FROM gpt-oss:120b
PARAMETER num_ctx 16384 PARAMETER temperature 1
then run:
ollama create gpt-oss:120b-16384 -f Modelfile
This should ensure that the model has sufficient context to consider all tool definitions. If this doesn't work, increase to 32k
Thank you!!!!!!! This worked for me (I am using ollama)
I think Ideal solution here is doing a pass on ollama so that we automatically send the num_ctx param so u dont ahve to, I need to look into it, but last time you werent able to do that via openai compat endpoints... If this has changed it should be an easy fix