Ye Wang
Ye Wang
### Description 1. add a config key in run_options to control cuda graph in runtime. 2. enhance cuda graph class to support mutiple graph saving and retrieving in one ORT...
I tried to replace SGemm() with CublasLtMatMul() for its multiple choices of Algos such as Tile but found that CublasLtMatMul() is in general slower compared with Gemm(). Is it expected?...
### Description Conditionally route to custom AllReduce kernel when buffer size and gpu numbers meet certain requirements. Otherwise, keep using NCCL's AllReduce. ### Motivation and Context