Even in interactive mode, multiturn conversation is not possible.
Thanks for the wonderful work!
I am running the falcon-7b-instruct model with falcon_main, I generated the appropriate model with the conversion script and from warning messages, I can tell it is in the old format. Anyway, it runs perfectly fine for the given prompt but I cannot continue the chat after the model generates its output, even in the interactive mode. Since there will be a significant time overhead due to GPU offloading every time the falcon_main script runs, I would like to have multiturn conversations in a single run. Is there a way to achieve that?
I'm sorry, there are indeed a couple bugs in the chat mode. It works most reliable when using an openassistant model.
With some other finetunes I noticed a problem with stopwords, for most fine tunes it uses stopwords to break them from "babbling" and those sometimes cause issues in chat mode. You can override the stopwords with -S "----". Maybe give that a try, also try OpenAssistant. I've had long chats with that already
Which fine tune did you use ?
I'll try to fix that once and for all as soon as I have the new release ready, but that can take a few days as it's a big change I am sitting on.
If you work with larger prompts, try the prompt-cache. It does not save you from the loading time but it allows to store an entire prompt preprocessed. Can save a lot of waiting time Update: don't use the prompt cache. It's broken with the new KV cache. Will be fixed with the next PR. To use the cache now, define FALCON_NO_KV_UPGRADE
I downloaded the Falcon 7B instruction fine-tuned model from https://huggingface.co/tiiuae/falcon-7b-instruct and saved it under ggllm.cpp/models/falcon7b_instruct with
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-7b-instruct", trust_remote_code=True)
model.save_pretrained("ggllm.cpp/models/falcon7b_instruct")
I manually copy-pasted tokenizer.json into the ggllm.cpp/models/falcon7b_instruct folder. Then, I converted the model with
python falcon_convert.py models/falcon7b_instruct models/7B
I can use the .bin model with falcon_main as I explained before.
If I modify stopwords with -S, the application quits after the stopwords are generated by the model, rather than returning control to the user.
Edit: I guess I found the source of the problem. I only provided the --interactive-first flag which gives the first turn to me but does not allow multiturn conversation. Adding -ins allows multiturn conversation. Feel free to close the issue.