PeixuanZuo issues

Results 5 issues of


                                            PeixuanZuo

Peixuanzuo/add migraphx ci

**Description**: Describe your changes. Add migraphx ci pipeline, test build and unit tests. This PR is based on #11492 Pipeline : https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=651727&view=logs&j=d16d5fd4-8d86-5567-da58-c395dcb46727&t=d16d5fd4-8d86-5567-da58-c395dcb46727

[ROCm] add SkipLayerNorm vectorize Regular case

**Description**: Describe your changes. Related PR: https://github.com/microsoft/onnxruntime/pull/12803 https://github.com/microsoft/onnxruntime/pull/12816 https://github.com/microsoft/onnxruntime/pull/12817 add SkipLayerNorm vectorize regular case 1. when hidden size 1024, SkipLayerNormTunable op can only use regular case. **Motivation and Context** -...

[ADD] add tunable SkipLayerNorm for ROCm EP

**Description**: Describe your changes. Related PR: https://github.com/microsoft/onnxruntime/pull/12803 https://github.com/microsoft/onnxruntime/pull/12816 https://github.com/microsoft/onnxruntime/pull/12821 1.add tunable skip layernorm for rocm ep 2. keep origin implementation when disable tuning. **Motivation and Context** - Why is this...

Allow fastgelu/skiplayernorm profile by pass args from commandline

**Description**: Describe your changes. This allow us quickly launch a microbench session by, for example: `python skip_layer_norm_test.py 8 128 128 float32 ` Related PR: https://github.com/microsoft/onnxruntime/pull/12803 https://github.com/microsoft/onnxruntime/pull/12816 https://github.com/microsoft/onnxruntime/pull/12817 https://github.com/microsoft/onnxruntime/pull/12821 Reference: https://github.com/microsoft/onnxruntime/pull/12991...

Slow performance about Gemm_add_add_layernorm

Use https://github.com/ROCmSoftwarePlatform/composable_kernel/tree/develop/client_example/03_gemm_layernorm and set [b_only_run_first_kernel = false ](https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/develop/client_example/03_gemm_layernorm/gemm_add_add_layernorm.cpp#L216) to run all instance. There are two problems. 1. normalize performance is very slow, slower than layernorm. I found an existing [comment](https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/develop/include/ck/tensor_operation/gpu/device/device_elementwise_2d.hpp#L175)...