Results 7 issues of rocking

In this file device_gemm_add_add_fastgelu_xdl_c_shuffle_f16_f16_f16_f16_f16_mk_nk_mn_mn_mn_instance.cpp Add following instance `DeviceGemmMultipleD_Xdl_CShuffle< Row, Col, Row_Row_Tuple, Row, F16, F16, F32, F32, F16_F16_Tuple, F16, PassThrough, PassThrough, AddAddFastGelu, GemmDefault, 1, 512, 128, 128, 32, 8, 8, 32,...

Under Investigation

We only support 2D (DeviceBinaryElementWise2D) so far. Support more dimension for flexibility

code quality

- This PR implement the AMD / ROCm version of c++ flash api 1. mha_fwd 2. mha_varlen_fwd 3. mha_bwd 4. mha_varlen_bwd - The kernel implementation comes from [composable kernel](https://github.com/ROCm/composable_kernel) -...

### Problem Description This issue happen in cpp_extension. I use pytorch's cpp_extension to compile the CK instead of regular cmake. When we cast from __half to _Float16 if to compile...

1. Add one pass pipeline, switch one pass and two pass pipeline according to problem size 2. Fix compile error 3. Support padding

## Proposed changes 1. Simpler kernel example for layernorm 2. use store_tile_raw for Default2DEpilogueProblem to improve performance ## Checklist use following command to check performance make -j tile_layernorm2d_fwd && ./bin/tile_layernorm2d_fwd...