chatllm.cpp
chatllm.cpp copied to clipboard
Pure C++ implementation of several models for real-time chatting on your computer (CPU)
```shell ggml_opencl: selecting platform: 'NVIDIA CUDA' ggml_opencl: selecting device: 'NVIDIA GeForce RTX 3080' ________ __ __ __ __ ___ (百川) / ____/ /_ ____ _/ /_/ / / / /...
With 686 tokens, a single run would take more than 6 secodns on a 96C machine. Here is the profiling data for compute graph. [bge-reranker-dump.txt](https://github.com/user-attachments/files/15910956/bge-reranker-dump.txt) Any advice for better performance?
GGML is kind of not supported anymore and all models have moved to GGUF as a standard a year ago. Are there any plans to support it here? I'm wondering...
Follow the new code in the github, compile success, thanks. Then follow the 'Tutorial on RAG', I use the code on the github to generate 'fruits.dat', and in the next...
The `model_downloader.py` script doesn't list the recently supported phi 3.5 moe. I'd like to also know if it's ok to use the v0.3 release from Jul 6 as-is to run...