composable_kernel
composable_kernel copied to clipboard
Slow performance about Gemm_add_add_layernorm
Use https://github.com/ROCmSoftwarePlatform/composable_kernel/tree/develop/client_example/03_gemm_layernorm and set b_only_run_first_kernel = false to run all instance.
There are two problems.
- normalize performance is very slow, slower than layernorm. I found an existing comment point to this issue.
Use M = 8192, N = 320, K = 320, the result of gemm_add_add_reduce_normalize:
/composable_kernel/client_example/build/03_gemm_layernorm# ./client_gemm_add_add_reduce_normalize
found 13 gemm_reduceMean_reduceSquareMean instances
found 4 normalize instances
DeviceGemmBiasAddReduce_Xdl_CShuffle<256, 256, 128, 32, 8, 8> does not support this problem
DeviceGemmBiasAddReduce_Xdl_CShuffle<256, 128, 256, 32, 8, 8> does not support this problem
DeviceGemmBiasAddReduce_Xdl_CShuffle<128, 128, 128, 32, 8, 8> does not support this problem
DeviceGemmBiasAddReduce_Xdl_CShuffle<256, 128, 128, 32, 8, 8> does not support this problem
launch_and_time_kernel: grid_dim {320, 1, 1}, block_dim {128, 1, 1}
Warm up 1 time
Start running 10 times...
Gemm + reduce Perf: 0.04096 msDeviceGemmBiasAddReduce_Xdl_CShuffle<128, 128, 64, 32, 8, 8>
DeviceGemmBiasAddReduce_Xdl_CShuffle<128, 64, 128, 32, 8, 8> does not support this problem
launch_and_time_kernel: grid_dim {640, 1, 1}, block_dim {64, 1, 1}
Warm up 1 time
Start running 10 times...
Gemm + reduce Perf: 0.0457441 msDeviceGemmBiasAddReduce_Xdl_CShuffle<64, 64, 64, 32, 8, 8>
launch_and_time_kernel: grid_dim {320, 1, 1}, block_dim {256, 1, 1}
Warm up 1 time
Start running 10 times...
Gemm + reduce Perf: 0.0382721 msDeviceGemmBiasAddReduce_Xdl_CShuffle<256, 128, 64, 32, 8, 8>
DeviceGemmBiasAddReduce_Xdl_CShuffle<256, 64, 128, 32, 8, 8> does not support this problem
launch_and_time_kernel: grid_dim {640, 1, 1}, block_dim {128, 1, 1}
Warm up 1 time
Start running 10 times...
Gemm + reduce Perf: 0.0492801 msDeviceGemmBiasAddReduce_Xdl_CShuffle<128, 128, 32, 32, 8, 8>
DeviceGemmBiasAddReduce_Xdl_CShuffle<128, 32, 128, 32, 8, 8> does not support this problem
launch_and_time_kernel: grid_dim {1280, 1, 1}, block_dim {64, 1, 1}
Warm up 1 time
Start running 10 times...
Gemm + reduce Perf: 0.0515681 msDeviceGemmBiasAddReduce_Xdl_CShuffle<64, 64, 32, 32, 8, 8>
launch_and_time_kernel: grid_dim {1280, 1, 1}, block_dim {64, 1, 1}
Warm up 1 time
Start running 10 times...
Gemm + reduce Perf: 0.0516961 msDeviceGemmBiasAddReduce_Xdl_CShuffle<64, 32, 64, 32, 8, 8>
launch_and_time_kernel: grid_dim {120, 1, 1}, block_dim {256, 1, 1}
Warm up 1 time
Start running 10 times...
Normalize Perf: 0.0485601 ms
launch_and_time_kernel: grid_dim {120, 1, 1}, block_dim {256, 1, 1}
Warm up 1 time
Start running 10 times...
Normalize Perf: 0.0615522 ms
launch_and_time_kernel: grid_dim {120, 1, 1}, block_dim {256, 1, 1}
Warm up 1 time
Start running 10 times...
Normalize Perf: 0.0760002 ms
launch_and_time_kernel: grid_dim {120, 1, 1}, block_dim {256, 1, 1}
Warm up 1 time
Start running 10 times...
Normalize Perf: 0.102016 ms
The result of layernorm
./ckProfiler layernorm 0 1 2 0 1 --length 8192 320
found 13 instances
...
length = 8192,320, stride = 320,1, reduce dims 1
best perf = 0.0288481 ms, 363.526 GB/s, DeviceNormalizationImpl<256,M_C8_S1,K_C32_S8,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
- gemm_reduce doesn't have great performance. the performance time is microseconds(us). and skiplayernorm in ort is equal to add_add_layernorm in CK
| -- | CK | -- | -- | -- | ORT | -- | -- | -- |
|---|---|---|---|---|---|---|---|---|
| MNK | gemm_bias_add_reduce | normalize | gemm | layernorm | gemm+reduce+normalize | gemm | skiplayernorm | gemm+skiplayernorm |
| [8192,320,320] | 41.1 | 45 | 26.2 | 29 | 86.1 | 25.2 | 18.37 | 43.57 |
| [2048,640,640] | 42 | 30 | 25.6 | 16 | 72 | 23.6 | 9 | 32.6 |
| [512,1280,1280] | 47 | 21 | 32.8 | 10 | 68 | 21.4 | 7.71 | 29.11 |