torchchat AMD CPU generation is very slow

Very slow tokens/second in FP32, feels worse than it should be, but I'm not entirely sure the best way to debug.

$ python3 torchchat.py generate --prompt "hello model" -v llama2 Using device=cpu AMD Ryzen 7 3700X 8-Core Processor Loading model... Time to load model: 2.35 seconds tensor([ 1, 22172, 1904], dtype=torch.int32) hello model

[snip output] Time for inference 1: 1043.69 sec total, 0.19 tokens/sec Bandwidth achieved: 2.58 GB/s Max Sequence Length Reached. Ending Conversation. Average tokens/sec: 0

I will try it on a couple of other dtypes as well, but this feels outside the range of expectations @malfet?

Apr 30 '24 16:04 ianbarber

Well, that's MKL for you, that refuses to believe that AMD CPUs capable of AVX512 instructions, even when they are. Try https://documentation.sigma2.no/jobs/mkl.html

Apr 30 '24 16:04 malfet

ah so unless we force it, our AMD performance will just be painful?

Apr 30 '24 16:04 ianbarber

ah so unless we force it, our AMD performance will just be painful?

Sadly, that is correct. We might determine which is the better setting for AMD - AVX2 or AVX512 - and if we can ascertain what core types work with what settings, programmatically? If that doesn't work, add reference with "AMD appears slow"

May 03 '24 14:05 mikekgfb

As a second solution to this, the default dtype is now --dtype "fast" and also --device "fast".

dtype fast: float16 on mobile float16 on macOS < 14 bfloat16 on macOS >=14 bfloat16 on other systems (in particular, Linux)

device fast: if available CUDA then, if available MPS then, CPU

May 03 '24 14:05 mikekgfb

So, what are we going to do about this? Write it in the liner notes, or automatically set the variable to override?

May 04 '24 05:05 mikekgfb