cduk
cduk
### Describe the bug Starting with the prompt "The Eiffel Tower is " continues "324 meters high and weighs 7,300 tons. It was built in 1889 for the Universal Exhibition...
I'm trying to get AIChat to work with a shortcut so that after typing in a chat message and pressing the hotkey in insert mode it would run AIChat. So...
Instead of running an instance per model in the dockerfile. Can a list of models be provided at instantiation and then the model is chosen via the api request. The...
Llama 3
Given that we have only Llama 3 70B and 8B, it would be useful to have a Tiny Llama based on the Llama 3 tokenizer so that we can use...
### Your current environment Collecting environment information... PyTorch version: 2.3.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.4 LTS...
There are many other parameters that are not passed through. Maybe arbitrary options can be passed through. Most important are stop tokens, early stopping, repetition etc. SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0,...
With Qwen3 you can dynamically control whether thinking is used or not. In some cases from the CLI, I want thinking to happen but take only the final output to...
### Feature request Is there a way of receiving the embeddings back in BQ format? Right now, I receive the full precision embedding and quantize it in the client, but...
The enterprise version has some additional features for auto-repair. Can you include more details on this in the wiki including whether this involves any changes to the on-disk format (if...
### Feature request Currently compile options allow specifying compute cap down to 75. Can Pascal generation cc 6.0/6.1 also be supported? ``` Dockerfile-cuda:50 -------------------- 49 | 50 | >>> RUN...