exllama icon indicating copy to clipboard operation
exllama copied to clipboard

Added streaming langchain example.

Open CoffeeVampir3 opened this issue 2 years ago • 2 comments

I think adding this as an example makes the most sense, this is a relatively complete example of a conversation model setup using Exllama and langchain. I've probably made some dumb mistakes as I'm not extremely familiar with the inner workings of Exllama, but this is a working example.

I should note, this is meant to serve as an example for streaming, it falls back to generate_simple on non-streaming and isin't meant to be used here.

CoffeeVampir3 avatar Jun 18 '23 14:06 CoffeeVampir3

So, I can't actually get this to produce any output? If I just run it as is, with a prompt of "Hello?" and a breakpoint in the stream() function, the context passed to the model looks like this:

The following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.

Current conversation:

Human: 
    ### Instruction: 
    You are an extremely serious chatbot. Do exactly what is asked of you and absolutely nothing more.
    ### User:
    Hello?
    ### Response:

    
AI:

It looks like there are two nested prompt formats there. I would expect the generation to start from "### Response:" following the Alpaca template, and at least with the models I've tried the model starts by generating " \n ###", which becomes a stop condition.

turboderp avatar Jun 20 '23 20:06 turboderp

I don't exactly know why the model wouldn't generate anything, potentially this was an issue with models being temperamental about the formats.

I made the following changes: Added a bunch of debugging outputs and some basic benchmarking. Switched the prompt template to an airoboros-vicuna format using the advice from https://huggingface.co/jondurbin/airoboros-33b-gpt4-1.2 and also wired it correctly to the history as well, their should be only one singular prompt format now. Corrected some bugs around the generation length running past the attention cache's maximum size. Fixed stops being case sensitive

I'm unsure why nothing would be generated, but potentially the models were being confused by the mixed formats. If the issue persists that's more troubling as I'd have no idea what would cause nothing to be generated here. I've tested on about 10 models and they're all performing quite well. At any rate, let me know if the issues continue and I'll investigate if necessary.

CoffeeVampir3 avatar Jun 20 '23 23:06 CoffeeVampir3