Kevin Pham comments

Results 20 comments of


                                            Kevin Pham

[Roadmap] vLLM Roadmap Q2 2024

Maybe consider supporting QuaRot quantization scheme? > [QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs](https://arxiv.org/abs/2404.00456) We introduce QuaRot, a new Quantization scheme based on Rotations, which is able to quantize LLMs...

[Question] LLaVA Pretraining with Mixtral 8×7B

Interesting, would pretraining on mixtral-8x-22b also be possible?

About parameter `max_tokens`

Usually specifying a punctuation stop token is sufficient. I've found that leading the model with an `'` is enough to make it want to close the quote and works well....

About parameter `max_tokens`

Depends on the model I think. Sometimes if you have no idea how long it is, you could try starting the prompt off in XML tags, and in the instructions...

run examples/llama-2/qlora-fsdp.yml failed

> ``` > ValueError: Must flatten tensors with uniform dtype but got torch.float32 and torch.float16 > ``` Have same issue too.

[Question] Is vLLMRollout.generate_sequences the right place to implement tool calling?

Actually-- I thought about it a bit more. Perhaps the best way is to implement a custom LogitsProcessor for vLLM, which does this function calling by hijacking the logits at...

[Question] Is vLLMRollout.generate_sequences the right place to implement tool calling?

Yes I would be interested in contributing. Traditional function calling is usually done with the vLLM [LLM.chat()](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/llm.py#L618C1-L632C30) [calling semantics](https://docs.vllm.ai/en/latest/features/tool_calling.html). But we could leave this up to the user by letting...