BitNet icon indicating copy to clipboard operation
BitNet copied to clipboard

How to use this model on Windows with structured JSON output and persistent system instructions (like OpenAI Playground)

Open Yash-Y09 opened this issue 8 months ago • 1 comments

Hi team,

I'm trying to use this model locally on Windows with minimal system dependencies, ideally via llama.cpp. In OpenAI Playground, we can set a System Instruction and the model provides consistent structured JSON outputs based on user inputs, while also retaining short-term conversational context. I want to replicate this functionality locally.

What I'm trying to achieve: I have a long system instruction (~78 lines) describing how to interpret user commands and convert them into a fixed structured JSON.

Example commands from the user:

"Turn on the task light" "Switch the light to green" "Set the desk height to 55"

The model should output a consistent structured JSON like:

json { "task": "RGB Color Control", "action": "set_color", "value": "green" }

What I’ve tried: I added the full system instruction into the prompt template in llama.cpp. But the model often responds with just a plain message like "ok light turned on" instead of the required JSON. Sometimes it even echoes the entire prompt/instruction back in the response, which is not expected.

Questions: What’s the best practice to persist such a long system instruction across user prompts? How can I force the model to respond only with a specific JSON schema every time? Is there a way in llama.cpp to maintain short-term memory (like a few previous turns) to help with conversational context?

Any advice or example configurations to achieve this behavior with local inference (especially via llama.cpp) would be greatly appreciated.

Thanks!

Yash-Y09 avatar Jun 09 '25 13:06 Yash-Y09

  1. Persisting a Long System Instruction llama.cpp doesn’t have a built-in “system instruction” feature like OpenAI API does. You have to manually include your long system instruction in every prompt you send. But that can get super long and repetitive!

Tips:

Save your system instruction as a separate text block.

Before sending each user message, concatenate the system instruction + recent conversation history + user input into one prompt.

Keep the prompt under your model’s max token limit (or it’ll get cut off).

  1. Forcing the Model to Output Only JSON This is tricky! Models are kinda free-spirited, but you can push them with careful prompt engineering:

Start your prompt with something like: "You are a JSON-only response bot. Respond only with valid JSON, no extra text."

Provide a few clear examples of user commands + the exact JSON you want (few-shot learning).

End your prompt with a user command and an explicit request for JSON only, e.g., "Respond with JSON:"

Sometimes, adding instructions like: "Do not add explanations or greetings, only return the JSON object." helps.

[System Instruction - your 78 lines here]

User: Turn on the task light  
Assistant: { "task": "Light Control", "action": "turn_on", "value": "task light" }

User: Switch the light to green  
Assistant: { "task": "RGB Color Control", "action": "set_color", "value": "green" }

User: Set the desk height to 55  
Assistant: { "task": "Desk Adjustment", "action": "set_height", "value": 55 }

User: [NEW USER COMMAND HERE]  
Assistant: 

xennon-sudo avatar Jul 17 '25 19:07 xennon-sudo