llama-api
llama-api copied to clipboard
An OpenAI-like LLaMA inference API
Hi, this was working really quite well on CPU for me, but I gave the tool access to the paths for libcublas, it compiled and now can't start or load...
 When I run a model on my GPU, my CPU and RAM Usage is insanely high
Hi! I have a strange suggestion :) Do a proxy object that will send requests to openal if in openai_replacement_models specifies openai_proxy (or something like it). For example: openai_replacement_models =...
Hello, I appreciate this API, but I am struggling to use the embedding part with langchain, is there any support regarding how to (if possible) use the embedding with langchain?...
Hello can someone guide me to run this nice API in CPU mode only
Please add support for exllamav2
Support [min_p sampler](https://old.reddit.com/r/LocalLLaMA/comments/17vonjo/your_settings_are_probably_hurting_your_model_why/), which is implemented in ExLlamav2.-
For example openchat 3.5 wants this prompt template format: GPT4 User: {prompt}GPT4 Assistant: I tried a few things a managed to crash the server so I am stuck. Can anyone...
Could there be some new format of gguf that we need to update the code for or something?
It's not clear from the documentation how to split VRAM over multiple GPUs with exllama.