Kevin Pham

Results 20 comments of Kevin Pham

Maybe consider supporting QuaRot quantization scheme? > [QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs](https://arxiv.org/abs/2404.00456) We introduce QuaRot, a new Quantization scheme based on Rotations, which is able to quantize LLMs...

Interesting, would pretraining on mixtral-8x-22b also be possible?

Usually specifying a punctuation stop token is sufficient. I've found that leading the model with an `'` is enough to make it want to close the quote and works well....

Depends on the model I think. Sometimes if you have no idea how long it is, you could try starting the prompt off in XML tags, and in the instructions...

> ``` > ValueError: Must flatten tensors with uniform dtype but got torch.float32 and torch.float16 > ``` Have same issue too.

Actually-- I thought about it a bit more. Perhaps the best way is to implement a custom LogitsProcessor for vLLM, which does this function calling by hijacking the logits at...

Yes I would be interested in contributing. Traditional function calling is usually done with the vLLM [LLM.chat()](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/llm.py#L618C1-L632C30) [calling semantics](https://docs.vllm.ai/en/latest/features/tool_calling.html). But we could leave this up to the user by letting...

After doing a bit of digging, perhaps this API design enabling multi-turn tool calling interaction is not feasible from a performance perspective. Here's why: > Does the chat API support...

I don’t think my proposal is feasible… it’s fast, but too cumbersome to work with. Check out the work being done in TRL though https://github.com/huggingface/trl/pull/2810

> [@accupham](https://github.com/accupham) can you elaborate on "too cumbersome to work with"? thanks. It’s difficult to work with raw tokens to implement tool calling via logit processor. Token boundaries are very...