Eric Buehler
Eric Buehler
We should distinguish between 2 cases in `api_get_file!`: - 404: read from local - Anything else: propagate error Currently, if the "error" is not 404, we will still attempt reading...
This is currently pending on some way to do topk in Candle.
This will allow loading very large models onto the CPU and then applying ISQ onto the device.
Model Wishlist
Please let us know what model architectures you would like to be added! **Up to date todo list below. Please feel free to contribute any model, a PR without device...
- [ ] RowParallelLinear - [ ] MergedColumnParallelLinear - [ ] QKVParallelLinear
Refs and closes #215. # Api addition - DeviceMapper - All at-loading-time methods have `loading_isq` parameter - Add `fn set_nm_device, loading_isq: bool) -> VarBuilder
# Description In this PR, I have added support for the [`mistral.rs`](https://github.com/EricLBuehler/mistral.rs) LLM inference platform via a new integration. `mistral.rs` is a new LLM inference platform with key features such...
Argsort was just added to Candle (https://github.com/huggingface/candle/pull/2132). Using an argsort kernel will accelerate the current CPU sorting part of `topk` or `topp` sampling, which takes a lot of time.
Speculative decoding: https://arxiv.org/pdf/2211.17192 This will refactor the pipeline structure to make the sampling process more abstracted. Additionally, it will also abstract the scheduling and kv cache management. # Restriction -...
Also enable logging for pyo3 bindings.