ggllm.cpp Even in interactive mode, multiturn conversation is not possible.

Thanks for the wonderful work!

I am running the falcon-7b-instruct model with falcon_main, I generated the appropriate model with the conversion script and from warning messages, I can tell it is in the old format. Anyway, it runs perfectly fine for the given prompt but I cannot continue the chat after the model generates its output, even in the interactive mode. Since there will be a significant time overhead due to GPU offloading every time the falcon_main script runs, I would like to have multiturn conversations in a single run. Is there a way to achieve that?

Jul 17 '23 13:07 ehalit

I'm sorry, there are indeed a couple bugs in the chat mode. It works most reliable when using an openassistant model.

With some other finetunes I noticed a problem with stopwords, for most fine tunes it uses stopwords to break them from "babbling" and those sometimes cause issues in chat mode. You can override the stopwords with -S "----". Maybe give that a try, also try OpenAssistant. I've had long chats with that already

Which fine tune did you use ?

I'll try to fix that once and for all as soon as I have the new release ready, but that can take a few days as it's a big change I am sitting on.

If you work with larger prompts, try the prompt-cache. It does not save you from the loading time but it allows to store an entire prompt preprocessed. Can save a lot of waiting time Update: don't use the prompt cache. It's broken with the new KV cache. Will be fixed with the next PR. To use the cache now, define FALCON_NO_KV_UPGRADE

Jul 17 '23 21:07 cmp-nct

I downloaded the Falcon 7B instruction fine-tuned model from https://huggingface.co/tiiuae/falcon-7b-instruct and saved it under ggllm.cpp/models/falcon7b_instruct with

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-7b-instruct", trust_remote_code=True)

model.save_pretrained("ggllm.cpp/models/falcon7b_instruct")

I manually copy-pasted tokenizer.json into the ggllm.cpp/models/falcon7b_instruct folder. Then, I converted the model with

python falcon_convert.py models/falcon7b_instruct models/7B

I can use the .bin model with falcon_main as I explained before.

Jul 18 '23 06:07 ehalit

If I modify stopwords with -S, the application quits after the stopwords are generated by the model, rather than returning control to the user.

Edit: I guess I found the source of the problem. I only provided the --interactive-first flag which gives the first turn to me but does not allow multiturn conversation. Adding -ins allows multiturn conversation. Feel free to close the issue.

Jul 18 '23 11:07 ehalit