llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

webui: Add a "Continue" Action for Assistant Message

Open allozaur opened this issue 3 months ago • 20 comments

Close #16097

Add Continue and Save features for chat messages

What's new

Continue button for assistant messages

  • Click the arrow button on any assistant response to keep generating from where it left off
  • Useful for getting longer outputs or continuing after you've edited a response
  • New content gets appended to the existing message

Save button when editing user messages

  • Now you get three options when editing: Cancel, Save, and Send
  • Save keeps your edit without regenerating the response (preserves the conversation below)
  • Send saves and regenerates like before
  • Useful when you just want to fix a typo without losing the assistant's response

Technical notes

  • Added continueAssistantMessage() and editUserMessagePreserveResponses() methods to ChatStore
  • Continue feature sends a synthetic "continue" prompt to the API (not saved to the database)
  • Assistant message edits now preserve trailing whitespace for proper continuation
  • Follows existing component architecture patterns

Demos

ggml-org/gpt-oss-20b-GGUF

https://github.com/user-attachments/assets/c4040464-4116-4239-b3b9-f5ec109bc3d9

unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF

https://github.com/user-attachments/assets/5f1d1856-8353-4ab6-af47-87caaef2dd17

allozaur avatar Nov 03 '25 15:11 allozaur

Is this supposed to work correctly when pressing Continue after stopping a response while it is generating? I am testing with gpt-oss and after Continue the text does not seem to resume as expected.

ggerganov avatar Nov 03 '25 15:11 ggerganov

Is this supposed to work correctly when pressing Continue after stopping a response while it is generating? I am testing with gpt-oss and after Continue the text does not seem to resume as expected.

I've tested it for the edited assistant responses so far. I will take a close look at the stopped generation -> continue flow as well

allozaur avatar Nov 03 '25 15:11 allozaur

Is this supposed to work correctly when pressing Continue after stopping a response while it is generating? I am testing with gpt-oss and after Continue the text does not seem to resume as expected.

When using gpt-oss in Lm Studio the model generates a new response instead of continuing the previous text, this is because of the Harmony parser, uninstalling it resolves this and the model continues the generation successfully.

Iq1pl avatar Nov 05 '25 05:11 Iq1pl

@ggerganov please check the demos i've attached to the PR description and also test this feature on your end. looking forward to your feedback!

allozaur avatar Nov 12 '25 17:11 allozaur

Continue feature sends a synthetic "continue" prompt to the API (not saved to the database)

Hm, I wonder why do it like this. We already have support on the server to continue the assistant message if it is the last one in the request #13174:

https://github.com/ggml-org/llama.cpp/blob/c7e23c79cf4316c385cee0a9d53b688b0d66686f/tools/server/utils.hpp#L729-L751

The current approach often does not continue properly, as can be seen in the sample videos:

image

Using the assistant prefill functionality above would make this work correctly in all cases.

ggerganov avatar Nov 13 '25 09:11 ggerganov

Agree with @ggerganov , it's better to use the prefill assistant message from https://github.com/ggml-org/llama.cpp/pull/13174

Just one thing to note though, I think most templates does not support formatting the reasoning content back to original, so probably that's the only case where it will break

ngxson avatar Nov 13 '25 10:11 ngxson

Thanks guys, I missed that! Will patch it and come back to you.

allozaur avatar Nov 13 '25 10:11 allozaur

@ggerganov @ngxson

I've updated the logic with 859e496 and i have tested with few models and only 1 (Qwen3-VL-32B-Instruct-GGUF) managed to properly continue the assistant message in response to the prefill request. See videos below.

Qwen3-VL-32B-Instruct-GGUF

https://github.com/user-attachments/assets/18f79deb-07e4-4172-9102-52df8f582c50

ggml-org/gpt-oss-20b-gguf

https://github.com/user-attachments/assets/f51f9d02-4444-42c6-9c77-c18aaeab9fd0

ggml-org/gpt-oss-120b-gguf

https://github.com/user-attachments/assets/6012222b-b51d-4e5d-adfc-fb399d151cd8

unsloth/gemma3-12b-it-gguf

https://github.com/user-attachments/assets/0f5b6794-0e22-426c-b000-0e4d202485f2

allozaur avatar Nov 13 '25 19:11 allozaur

For me, both Qwen3 and Gemma3 are able to complete successfully. For example, here is Gemma3 12B IT:

https://github.com/user-attachments/assets/fb83fe10-50fe-449f-89bc-4a6b8db87604

It's strange that it didn't work for you.

Regarding gpt-oss - I think that "Continue" has to also send the reasoning in this case. Currently, it is discarded and I think it confuses the model.

ggerganov avatar Nov 13 '25 19:11 ggerganov

Regarding gpt-oss - I think that "Continue" has to also send the reasoning in this case. Currently, it is discarded and I think it confuses the model.

Should we then address the thinking models differently for now, at least from the WebUI perspective?

It's strange that it didn't work for you.

I will do some more testing with other instruct models and make sure all is working right.

allozaur avatar Nov 13 '25 20:11 allozaur

It's likely due to chat template, I suspect some chat templates (especially jinja) adds the generation prompt. Can you verify how the chat template looks like with POST /apply-template endpoint? (the request body is the same as /chat/completions)

ngxson avatar Nov 13 '25 20:11 ngxson

Regarding gpt-oss - I think that "Continue" has to also send the reasoning in this case. Currently, it is discarded and I think it confuses the model.

Should we then address the thinking models differently for now, at least from the WebUI perspective?

If it's not too complicated, I'd say change the logic so that "Continue" includes the reasoning of the last assistant message for all reasoning models.

ggerganov avatar Nov 13 '25 20:11 ggerganov

If it's not too complicated, I'd say change the logic so that "Continue" includes the reasoning of the last assistant message for all reasoning models.

The main issue is that some chat templates actively suppress the reasoning content from assistant messages, so I'm doubt if it will work cross all model.

Actually I'm thinking about a more generic approach, we can implement a feature in the backend such that both the "raw" generated text (i.e. with <think>, <reasoning>, etc) can be sent along with the already-parsed version.

I would say for now, we can put a warning in the webui to tell user that this feature is experimental and doesn't work cross all models. We can improve it later if it gets more usage.

ngxson avatar Nov 13 '25 20:11 ngxson

I would say for now, we can put a warning in the webui to tell user that this feature is experimental and doesn't work cross all models. We can improve it later if it gets more usage

Gotcha, @ngxson, let's do that

allozaur avatar Nov 13 '25 22:11 allozaur

I would say for now, we can put a warning in the webui to tell user that this feature is experimental and doesn't work cross all models. We can improve it later if it gets more usage.

For reasoning models we can also disable continue all together - I don't think it is useful for reasoning models as it is because it looses it's reasoning trace. Also looking at the logs, gpt-oss tends to produce some gibberish tokens that are not displayed in the UI when you use continue:

image image

ggerganov avatar Nov 14 '25 08:11 ggerganov

I don't want to interfere with you experts, I just want to share my insight, as I also struggled with gpt-oss on this issue, see the video. Of course, the implementation of Harmony might be different in llamacpp, but this is the only way to get the continue feature working for gpt-oss in LM Studio.

If it is possible, rather than disabling the function for all thinking models, adding an option to disable the Harmony parser might be better. Thank you for your hard work.

https://github.com/user-attachments/assets/dea7f069-413c-4bcc-97fc-8d54889fab5c

Iq1pl avatar Nov 14 '25 11:11 Iq1pl

@ggerganov @ngxson

I've added this setting and as for now we have "Continue" icon button rendered only for non-reasoning models.

Zrzut ekranu 2025-11-14 o 12 46 09

If it is possible, rather than disabling the function for all thinking models, adding an option to disable the Harmony parser might be better. Thank you for your hard work.

Maybe we can tackle this with the next iteration of this feature..? Idk, @ngxson do you think it's worth still doing this as a part of this PR or we want to revisit this in the future?

allozaur avatar Nov 14 '25 12:11 allozaur

If it is possible, rather than disabling the function for all thinking models, adding an option to disable the Harmony parser might be better.

I'm a bit surprise that it doesn't work in LM Studio. IIRC LM Studio doesn't actually modify, but they parse it for displaying, while still keeping the original generated content under the hood. CC @mattjcly from LM Studio team (as this probably a bug)

As I mentioned earlier in https://github.com/ggml-org/llama.cpp/pull/16971#issuecomment-3529609521 , we can preserve the raw content by introducing a new flag. But this is currently a low-prio task and we can do it later if more users need it

ngxson avatar Nov 14 '25 12:11 ngxson

If it is possible, rather than disabling the function for all thinking models, adding an option to disable the Harmony parser might be better.

I'm a bit surprise that it doesn't work in LM Studio. IIRC LM Studio doesn't actually modify, but they parse it for displaying, while still keeping the original generated content under the hood. CC @mattjcly from LM Studio team (as this probably a bug)

As I mentioned earlier in #16971 (comment) , we can preserve the raw content by introducing a new flag. But this is currently a low-prio task and we can do it later if more users need it

@ngxson i guess this doesn't stop us from having that PR reviewed and eventually merged?

allozaur avatar Nov 14 '25 14:11 allozaur

@ngxson i guess this doesn't stop us from having that PR reviewed and eventually merged?

Yes the current approach in this PR should be enough. Will give it a try a bit later.

ngxson avatar Nov 14 '25 15:11 ngxson

@ngxson i guess this doesn't stop us from having that PR reviewed and eventually merged?

Yes the current approach in this PR should be enough. Will give it a try a bit later.

Sure, lemme know!

allozaur avatar Nov 15 '25 20:11 allozaur

Noticed a small bug: if the server can absolute not continue the message, it will response with a stop event right away. This cause the webui to show an error, even when the request technically be success:

image

ngxson avatar Nov 17 '25 15:11 ngxson

Noticed a small bug: if the server can absolute not continue the message, it will response with a stop event right away. This cause the webui to show an error, even when the request technically be success:

image

So instead it should inform the user that the response can not be continued or do sth else? Let me know what u think would be the best pattern here.

allozaur avatar Nov 17 '25 18:11 allozaur

I think it should behave like nothing is added, without any error message. IIRC LM Studio has the same behavior.

ngxson avatar Nov 17 '25 20:11 ngxson

the response can not be continued

yeah a message like that can also be a good solution

ngxson avatar Nov 17 '25 20:11 ngxson

I think it should behave like nothing is added, without any error message. IIRC LM Studio has the same behavior.

@ngxson finally decided to go with this, check e68b4165f0ae6293ee171810d8493d049bfcf0da + 9f653dc68c91d458c2a67d27572ede07ce96692b

allozaur avatar Nov 18 '25 18:11 allozaur