Chris Sullivan
Chris Sullivan
I am using standard gdb with code that has CUDA library calls. I don't need to debug the CUDA specific code, but would like the functionality of rr for the...
First off, allow me to compliment you on this nice python library and f2py extension. After some work, I got it working for my library, but I did run into...
Adds native FP8 type support for CUDA. The e4m3/e5m2 struct types provide explicit type conversions that target hardware native conversion ops. \* Conditionally run Storage and Compute legalization for targets...
* Use CCL type traits to share common code between NCCL and MSCCLPP API invocations in disco * Add bench to validate results and compare various supported CCL approaches for...
[Blackwell] Support DescriptorLoadOp when deciding to use shared memory for scales This ensures scales produced by TMA are elligible for transfer to tensor memory in later lowering. #### [PR chain](https://github.com/jlebar/git-pr-chain)...
[Bench][Blackwell] Fix warp specialization for fp8 x mxfp4 bench Applies a bug fix for padded scale loads in fp8xmxfp4 mode ensuring TMA load requirements are met when using the unpacked...
[Bench][Blackwell] Support optional scale TMAs in warp specialization for tl.dot_scaled This enables automatic warp specialization for block scaled workloads. #### [PR chain](https://github.com/jlebar/git-pr-chain) 1. #6535 1. 👉 #6536 👈 **YOU ARE...