tinygrad icon indicating copy to clipboard operation
tinygrad copied to clipboard

Feature request : MPS Backend from LiuLiu's codebase

Open brappier opened this issue 2 years ago • 7 comments

Check out @liuliu 's NNC, maybe someone should use his mps backend in tinygrad. its much much faster and better written.

brappier avatar Mar 07 '23 23:03 brappier

https://github.com/liuliu/s4nnc

brappier avatar Mar 07 '23 23:03 brappier

Why is it faster? tinygrad should soon be able to saturate the hardware for common convs and matmuls, and for some models it's already generating the minimal number of kernels.

Minimum kernels + saturated reduces = max speed. Though the libnnc docs are neat to read, much better than tinygrad's.

geohot avatar Mar 08 '23 17:03 geohot

Firstly, using BHWC is faster on apple hardware then BCWH. idk if tinygrad is using that. else idk, havent investigated it.

brappier avatar Mar 09 '23 04:03 brappier

convs are implemented as HLOPs, even at the derivative layer there's no CONV. it's a quick change to use BCWH.

for example for llama, tinygrad's backend is 3x faster than the shapetracking in Python! so changing the backend would change nothing. (now...why the python is too slow is a different story)

Do you have a specific case where the backend is slow?

geohot avatar Mar 10 '23 04:03 geohot

Ill add some benchmarket scripts and send a PR maybe. But you can compare with NNC's stable diffusion on lowest end 8GB m1 mac.

https://github.com/liuliu/swift-diffusion this literally runs for like 1.3sec per iteration with FP16 and 1.8 for FP32, tinygrad cannot compete with that.

It could be the memory management.

brappier avatar Mar 10 '23 15:03 brappier

I think there are different objectives. While tinygrad is capable to integrate with other backends at any level (either HLOPs or LLOPs), they may want to reduce the exposed surface (i.e. maintenance burden) for a particular platform. Thus, it is probably more reasonable for them, if they want to support Apple GPU, to go to Metal compute shader route (and can led itself to Vulkan if applicable).

My understanding is that tinygrad occupies an interesting space where on one spectrum, there are TF or PyTorch which does kitchen sink approach of everything any level of backends goes as long as they provide approachable interface for end-users. The other end is MLIR or TVM, where it is straightly dealing with scheduling on lowest hardware level. tinygrad tries to provide an approachable end-user experience but also by clever splitting ops (much like Jittor or now PrimTorch) to be able to exploit underlying hardware to its limit with limited maintenance burden.

To that extent, MPS integration probably too high-level for tinygrad as it needs to give up bunch optimization opportunities where they won't be if they go directly to Metal.

liuliu avatar Mar 10 '23 16:03 liuliu

I agree! But as the same time its fun to have real speed benchmarks of real world models It will motivate lot of users to try out the platform

brappier avatar Mar 10 '23 16:03 brappier