Mechanism for injecting "fake" responses from models into the conversation
https://twitter.com/simonw/status/1694089359514104094
A useful trick is sometimes to feed a model a prior conversation that includes things that the model didn't actually say - things like "Sure, I'd be happy to help you with that" for minor jailbreaks.
Not sure what the CLI options for this should look like, or how they should be recorded in the SQLite database logs.
-
--answer-prefix 'Sure, I'd be happy to help you with that'This is good for jailbreaks. I don't believe it's necessary to save this prefix in a database; it would be simpler to include it at the beginning of the answer, giving the impression that the model started its answer with this prefix. -
Editing a conversation history is also needed sometimes. My best bet here is to output markdown or org-mode, open an editor that edits this markdown and then pipes it back to
llmto be parsed and converted to a proper JSON chat history.
Check out my project https://github.com/relston/mark, which is designed in a way that is perfect for this use case. I'm now leveraging LLM as a dependency.