Results 5 comments of Phil

Really excited about optimized kernels for inference! Worth looking at https://github.com/zeux/calm - where the forward pass is implemented as a single cuda kernel Uses fp8 rather than int4/8 quantization.

That's an interesting idea - worth experimenting My intuition is that it would be too generic and be difficult to get working repeatedly

Thanks! I don't think we should have the agent install dependencies though How about adding a top level requirements.txt?

How about https://pypi.org/project/retry/?