Transformers separate model server?
This may be a stupid question, please forgive if so
the openAI interface obviously relies an on idenpendently existing server for gpt-3.5 and gpt-4
the Transformers interface, though, assumes guidance will load the model internally. Loading models in Transformers takes forever, even when already cached.
Is there a way to point to an existing 'guidance' server to handle guidance prompts, so I don't have to wait an entire model startup cycle every prompt test when using Transformer models like Wizard-13B?
In the works.
If I understand the OP, this is something I am looking for as well. I want to host an ONNX model with Triton and have that interface with Guidance. @marcotcr, will what you have in the works support this?
I think this will be a issue for many, as the specifics of running an LLM are changing so fast that Guidance will have a hard time keeping up (see exllama for an example). If Guidance is in fact just using a REST API to talk to OpenAI, depending on the API features being used, it should be possible to switch out OpenAI's server for a local server running an OpenAI-compatible API such as text-generation-webui.
To that end, it would be really interesting/useful to see a list of all the API features that Guidance uses, so developers of open-source OpenAI API's could prioritize those features, since the API support for OpenAI in projects like text-generation-webui are certainly not complete.