James Whedbee comments

Results 16 comments of


                                            James Whedbee

OpenAI Tools / function calling v2

I am eagerly awaiting this too. Is there any area where contributions would be welcomed to help merge this?

[Feature]: AssertionError: Speculative decoding not yet supported for RayGPU backend.

running into this as well

GPU is not used even after specifying gpu_layers

I am seeing this too using `CT_HIPBLAS=1 pip install ctransformers --no-binary ctransformers`

[ROCm] Add support for Punica kernels on AMD GPUs

I was going to try this out soon, is this in a good spot or is it still being worked?

Speculative decoding slows model down, possibly from "skipping cudagraphs due to ['mutated inputs']"?

Yep tensor parallelism worked for me with no code changes! I'll try using Llama 2 70b unquantized tomorrow as the verifier model. Because int4 quantization is not supported for AMD...

Speculative decoding slows model down, possibly from "skipping cudagraphs due to ['mutated inputs']"?

@Chillee Maybe I misunderstood, could you give me an example command you think should result in a speed-up? I can get ~15 tokens/second for an unquantized LLama 70B using compile...

Speculative decoding slows model down, possibly from "skipping cudagraphs due to ['mutated inputs']"?

@Chillee that unfortunately also just results in ~8 tokens/second EDIT: just saw your edit

Speculative decoding slows model down, possibly from "skipping cudagraphs due to ['mutated inputs']"?

Hey, @Chillee were you able to learn more about the issue here?

GPTQ quantization not working

I am at the third bullet point here as well, going to just follow along to comments here

GPTQ quantization not working

That looked promising but I unfortunately ran into another issue you probably wouldn't have. I am on AMD so that might be the cause. I can't find anything online related...