Feature request : MPS Backend from LiuLiu's codebase
Check out @liuliu 's NNC, maybe someone should use his mps backend in tinygrad. its much much faster and better written.
https://github.com/liuliu/s4nnc
Why is it faster? tinygrad should soon be able to saturate the hardware for common convs and matmuls, and for some models it's already generating the minimal number of kernels.
Minimum kernels + saturated reduces = max speed. Though the libnnc docs are neat to read, much better than tinygrad's.
Firstly, using BHWC is faster on apple hardware then BCWH. idk if tinygrad is using that. else idk, havent investigated it.
convs are implemented as HLOPs, even at the derivative layer there's no CONV. it's a quick change to use BCWH.
for example for llama, tinygrad's backend is 3x faster than the shapetracking in Python! so changing the backend would change nothing. (now...why the python is too slow is a different story)
Do you have a specific case where the backend is slow?
Ill add some benchmarket scripts and send a PR maybe. But you can compare with NNC's stable diffusion on lowest end 8GB m1 mac.
https://github.com/liuliu/swift-diffusion this literally runs for like 1.3sec per iteration with FP16 and 1.8 for FP32, tinygrad cannot compete with that.
It could be the memory management.
I think there are different objectives. While tinygrad is capable to integrate with other backends at any level (either HLOPs or LLOPs), they may want to reduce the exposed surface (i.e. maintenance burden) for a particular platform. Thus, it is probably more reasonable for them, if they want to support Apple GPU, to go to Metal compute shader route (and can led itself to Vulkan if applicable).
My understanding is that tinygrad occupies an interesting space where on one spectrum, there are TF or PyTorch which does kitchen sink approach of everything any level of backends goes as long as they provide approachable interface for end-users. The other end is MLIR or TVM, where it is straightly dealing with scheduling on lowest hardware level. tinygrad tries to provide an approachable end-user experience but also by clever splitting ops (much like Jittor or now PrimTorch) to be able to exploit underlying hardware to its limit with limited maintenance burden.
To that extent, MPS integration probably too high-level for tinygrad as it needs to give up bunch optimization opportunities where they won't be if they go directly to Metal.
I agree! But as the same time its fun to have real speed benchmarks of real world models It will motivate lot of users to try out the platform