cutile-python
cutile-python copied to clipboard
[Question] Clarification on FP8 Micro-block Scaling and FP4 Support Timeline
Hi cuTile team,
I have two specific questions regarding the support for Blackwell-specific hardware features:
- Automatic Micro-block Scaling for FP8 When using fp8 with ct.matmul, how is the Micro-block Scaling (1x16) handled?
Automation: Does the tileiras compiler automatically handle the scaling logic and hardware invocation (5th-gen Tensor Core) under the hood?
Explicit Scaling: If it is not fully automatic, how should we provide the scale-factor tiles to the ct.matmul operator? Currently, the ct.matmul(A, B) signature seems to only accept data tiles. Is there a plan for a signature like ct.matmul(A, B, A_scale, B_scale)?
- NVFP4 (FP4) Support Roadmap The current documentation and samples focus on fp8 and bf16. Since Blackwell's throughput peak is tied to NVFP4: When can we expect the support for 4-bit narrow-precision tiles in cuTile Python?
Thanks for this great library!