Ye Wang

Results 3 issues of Ye Wang

### Description 1. add a config key in run_options to control cuda graph in runtime. 2. enhance cuda graph class to support mutiple graph saving and retrieving in one ORT...

I tried to replace SGemm() with CublasLtMatMul() for its multiple choices of Algos such as Tile but found that CublasLtMatMul() is in general slower compared with Gemm(). Is it expected?...

cuBLASLt

### Description Conditionally route to custom AllReduce kernel when buffer size and gpu numbers meet certain requirements. Otherwise, keep using NCCL's AllReduce. ### Motivation and Context