Phil
Phil
Really excited about optimized kernels for inference! Worth looking at https://github.com/zeux/calm - where the forward pass is implemented as a single cuda kernel Uses fp8 rather than int4/8 quantization.
That's an interesting idea - worth experimenting My intuition is that it would be too generic and be difficult to get working repeatedly
Thanks! I don't think we should have the agent install dependencies though How about adding a top level requirements.txt?
How about https://pypi.org/project/retry/?
+1 much needed feature