AMD CPU generation is very slow
Very slow tokens/second in FP32, feels worse than it should be, but I'm not entirely sure the best way to debug.
$ python3 torchchat.py generate --prompt "hello model" -v llama2 Using device=cpu AMD Ryzen 7 3700X 8-Core Processor Loading model... Time to load model: 2.35 seconds tensor([ 1, 22172, 1904], dtype=torch.int32) hello model
[snip output] Time for inference 1: 1043.69 sec total, 0.19 tokens/sec Bandwidth achieved: 2.58 GB/s Max Sequence Length Reached. Ending Conversation. Average tokens/sec: 0
I will try it on a couple of other dtypes as well, but this feels outside the range of expectations @malfet?
Well, that's MKL for you, that refuses to believe that AMD CPUs capable of AVX512 instructions, even when they are. Try https://documentation.sigma2.no/jobs/mkl.html
ah so unless we force it, our AMD performance will just be painful?
ah so unless we force it, our AMD performance will just be painful?
Sadly, that is correct. We might determine which is the better setting for AMD - AVX2 or AVX512 - and if we can ascertain what core types work with what settings, programmatically? If that doesn't work, add reference with "AMD appears slow"
As a second solution to this, the default dtype is now --dtype "fast" and also --device "fast".
dtype fast: float16 on mobile float16 on macOS < 14 bfloat16 on macOS >=14 bfloat16 on other systems (in particular, Linux)
device fast: if available CUDA then, if available MPS then, CPU
So, what are we going to do about this? Write it in the liner notes, or automatically set the variable to override?