Leon Knauer comments

Results 9 comments of


                                            Leon Knauer

TF 2.0 for macOS Catalina?

Hi there! I built Tensorflow 2.1 for macOS catalina with support for AVX, AVX2, FMA, SSE4.1 and SSE4.2. You can find the wheel file here: https://github.com/reuank/tensorflow-wheels-macOS/releases/tag/tensorflow-2.1-catalina

[Question] How to use Yi models with local llama-cpp-python when there is no standard tokenizer

Hey @Rybens92, I haven't testet this exact configuration myself yet, but you can try to specify a tokenizer directly by using the `tokenizer="ehartford/dolphin-2.2-yi-34b"` option. Playground example: ``` argmax "What is...

[Question] How to use Yi models with local llama-cpp-python when there is no standard tokenizer

You need to add the `trust_remote_code=True` option, as the YiTokenizer is not known to the `tokenizer` library by hf. This is also documented here: https://huggingface.co/ehartford/dolphin-2_2-yi-34b. With this, the downloaded tokenizer...

[Question] How to use Yi models with local llama-cpp-python when there is no standard tokenizer

On my machine, the following example runs in the LMQL playground and produces sensible output: ``` argmax "What is the capital of France? [RESPONSE]" from lmql.model("local:llama.cpp:/YOUR_PATH/dolphin-2_2-yi-34b.Q4_0.gguf", tokenizer="ehartford/dolphin-2_2-yi-34b", trust_remote_code=True) where len(TOKENS(RESPONSE))...

[Question] How to use Yi models with local llama-cpp-python when there is no standard tokenizer

Okay, I cannot reproduce that, and know too little about the rest of your setup and the other changes you have made. Glad that you found something that works for...

Feature request: Batched inference for llama.cpp models

Just a quick update on this topic. It looks like llama-cpp-python will add that feature very soon: https://github.com/abetlen/llama-cpp-python/pull/951.

Integrate vLLM

Hey @lbeurerkellner, are you aware of anyone currently working on this? Otherwise, I will have a look at the approach @ggbetz described (adding a new vLLM backend, similar to [llama_cpp_model](https://github.com/eth-sri/lmql/blob/main/src/lmql/models/lmtp/backends/llama_cpp_model.py)).

Starting the playground with a self-hosted model

Hi @KamilLegault, you can have a look here: https://lmql.ai/docs/models/llama.cpp.html#model-server. You can start a LMTP inference endpoint by running ```bash lmql serve-model llama.cpp:/YOUR_PATH/YOUR_MODEL.gguf ```` In the playground, you then need to...

Retrieve attention score for all input tokens per generated token

Hey @parallaxe, I am also very interested in this feature. Have you managed to get the attention scores yet?