guidance Transformers separate model server?

This may be a stupid question, please forgive if so

the openAI interface obviously relies an on idenpendently existing server for gpt-3.5 and gpt-4

the Transformers interface, though, assumes guidance will load the model internally. Loading models in Transformers takes forever, even when already cached.

Is there a way to point to an existing 'guidance' server to handle guidance prompts, so I don't have to wait an entire model startup cycle every prompt test when using Transformer models like Wizard-13B?

May 19 '23 03:05 bdambrosio

In the works.

Jun 06 '23 16:06 marcotcr

If I understand the OP, this is something I am looking for as well. I want to host an ONNX model with Triton and have that interface with Guidance. @marcotcr, will what you have in the works support this?

Jun 30 '23 14:06 zacharyblank

I think this will be a issue for many, as the specifics of running an LLM are changing so fast that Guidance will have a hard time keeping up (see exllama for an example). If Guidance is in fact just using a REST API to talk to OpenAI, depending on the API features being used, it should be possible to switch out OpenAI's server for a local server running an OpenAI-compatible API such as text-generation-webui.

To that end, it would be really interesting/useful to see a list of all the API features that Guidance uses, so developers of open-source OpenAI API's could prioritize those features, since the API support for OpenAI in projects like text-generation-webui are certainly not complete.

Jul 02 '23 17:07 tensiondriven