llama-api icon indicating copy to clipboard operation
llama-api copied to clipboard

An OpenAI-like LLaMA inference API

Results 16 llama-api issues
Sort by recently updated
recently updated
newest added

Hi, this was working really quite well on CPU for me, but I gave the tool access to the paths for libcublas, it compiled and now can't start or load...

![image](https://github.com/c0sogi/llama-api/assets/39416418/929e3839-fdd8-41b1-bbe4-1fa88bb5f8d3) When I run a model on my GPU, my CPU and RAM Usage is insanely high

Hi! I have a strange suggestion :) Do a proxy object that will send requests to openal if in openai_replacement_models specifies openai_proxy (or something like it). For example: openai_replacement_models =...

Hello, I appreciate this API, but I am struggling to use the embedding part with langchain, is there any support regarding how to (if possible) use the embedding with langchain?...

Hello can someone guide me to run this nice API in CPU mode only

Please add support for exllamav2

Support [min_p sampler](https://old.reddit.com/r/LocalLLaMA/comments/17vonjo/your_settings_are_probably_hurting_your_model_why/), which is implemented in ExLlamav2.-

For example openchat 3.5 wants this prompt template format: GPT4 User: {prompt}GPT4 Assistant: I tried a few things a managed to crash the server so I am stuck. Can anyone...

Could there be some new format of gguf that we need to update the code for or something?

It's not clear from the documentation how to split VRAM over multiple GPUs with exllama.