Chris Sullivan issues

Results 7 issues of


                                            Chris Sullivan

Support using GPU via CUDA

I am using standard gdb with code that has CUDA library calls. I don't need to debug the CUDA specific code, but would like the functionality of rr for the...

Method overloading via interfaces and type mistmatch

First off, allow me to compliment you on this nice python library and f2py extension. After some work, I got it working for my library, but I did run into...

[TIR][CUDA] Add native FP8 support to codegen

Adds native FP8 type support for CUDA. The e4m3/e5m2 struct types provide explicit type conversions that target hardware native conversion ops. \* Conditionally run Storage and Compute legalization for targets...

[Disco] Add MSCCLPP initialization along side NCCL

* Use CCL type traits to share common code between NCCL and MSCCLPP API invocations in disco * Add bench to validate results and compare various supported CCL approaches for...

[Blackwell] Support DescriptorLoadOp when deciding to use shared memory for scales

[Blackwell] Support DescriptorLoadOp when deciding to use shared memory for scales This ensures scales produced by TMA are elligible for transfer to tensor memory in later lowering. #### [PR chain](https://github.com/jlebar/git-pr-chain)...

[Bench][Blackwell] Fix warp specialization for fp8 x mxfp4 bench

[Bench][Blackwell] Fix warp specialization for fp8 x mxfp4 bench Applies a bug fix for padded scale loads in fp8xmxfp4 mode ensuring TMA load requirements are met when using the unpacked...

[Bench][Blackwell] Support optional scale TMAs in warp specialization for tl.dot_scaled

[Bench][Blackwell] Support optional scale TMAs in warp specialization for tl.dot_scaled This enables automatic warp specialization for block scaled workloads. #### [PR chain](https://github.com/jlebar/git-pr-chain) 1. #6535 1. 👉 #6536 👈 **YOU ARE...