cduk issues

Results 11 issues of


                                            cduk

Corrupt output when Beam>1 is used

### Describe the bug Starting with the prompt "The Eiffel Tower is " continues "324 meters high and weighs 7,300 tons. It was built in 1889 for the Universal Exhibition...

bug

How to get shortcut key for AIChat working in insert mode

I'm trying to get AIChat to work with a shortcut so that after typing in a chat message and pressing the hotkey in insert mode it would run AIChat. So...

Dynamic loading - different models at request time / multiple models

Instead of running an instance per model in the dockerfile. Can a list of models be provided at instantiation and then the model is chosen via the api request. The...

Llama 3

Given that we have only Llama 3 70B and 8B, it would be useful to have a Tiny Llama based on the Llama 3 tokenizer so that we can use...

[Bug]: FP8 Marlin fallback out of memory regression

### Your current environment Collecting environment information... PyTorch version: 2.3.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.4 LTS...

bug

Pass on additional sampling parameters to endpoint

There are many other parameters that are not passed through. Maybe arbitrary options can be passed through. Most important are stop tokens, early stopping, repetition etc. SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0,...

Qwen3 - thinking control

With Qwen3 you can dynamically control whether thinking is used or not. In some cases from the CLI, I want thinking to happen but take only the final output to...

enhancement

Binary quantization - evaluate quality

### Feature request Is there a way of receiving the embeddings back in BQ format? Right now, I receive the full precision embedding and quantize it in the client, but...

Enterprise version details

The enterprise version has some additional features for auto-repair. Can you include more details on this in the wiki including whether this involves any changes to the on-disk format (if...

Extend support to compute capability 6.0/6.1

### Feature request Currently compile options allow specifying compute cap down to 75. Can Pascal generation cc 6.0/6.1 also be supported? ``` Dockerfile-cuda:50 -------------------- 49 | 50 | >>> RUN...