James Whedbee
James Whedbee
I am eagerly awaiting this too. Is there any area where contributions would be welcomed to help merge this?
running into this as well
I am seeing this too using `CT_HIPBLAS=1 pip install ctransformers --no-binary ctransformers`
I was going to try this out soon, is this in a good spot or is it still being worked?
Yep tensor parallelism worked for me with no code changes! I'll try using Llama 2 70b unquantized tomorrow as the verifier model. Because int4 quantization is not supported for AMD...
@Chillee Maybe I misunderstood, could you give me an example command you think should result in a speed-up? I can get ~15 tokens/second for an unquantized LLama 70B using compile...
@Chillee that unfortunately also just results in ~8 tokens/second EDIT: just saw your edit
Hey, @Chillee were you able to learn more about the issue here?
I am at the third bullet point here as well, going to just follow along to comments here
That looked promising but I unfortunately ran into another issue you probably wouldn't have. I am on AMD so that might be the cause. I can't find anything online related...