composable_kernel icon indicating copy to clipboard operation
composable_kernel copied to clipboard

Slow performance about Gemm_add_add_layernorm

Open PeixuanZuo opened this issue 3 years ago • 0 comments

Use https://github.com/ROCmSoftwarePlatform/composable_kernel/tree/develop/client_example/03_gemm_layernorm and set b_only_run_first_kernel = false to run all instance.

There are two problems.

  1. normalize performance is very slow, slower than layernorm. I found an existing comment point to this issue.

Use M = 8192, N = 320, K = 320, the result of gemm_add_add_reduce_normalize:

/composable_kernel/client_example/build/03_gemm_layernorm# ./client_gemm_add_add_reduce_normalize 
found 13 gemm_reduceMean_reduceSquareMean instances
found 4 normalize instances
DeviceGemmBiasAddReduce_Xdl_CShuffle<256, 256, 128, 32, 8, 8> does not support this problem
DeviceGemmBiasAddReduce_Xdl_CShuffle<256, 128, 256, 32, 8, 8> does not support this problem
DeviceGemmBiasAddReduce_Xdl_CShuffle<128, 128, 128, 32, 8, 8> does not support this problem
DeviceGemmBiasAddReduce_Xdl_CShuffle<256, 128, 128, 32, 8, 8> does not support this problem
launch_and_time_kernel: grid_dim {320, 1, 1}, block_dim {128, 1, 1} 
Warm up 1 time
Start running 10 times...
Gemm + reduce Perf:    0.04096 msDeviceGemmBiasAddReduce_Xdl_CShuffle<128, 128, 64, 32, 8, 8>
DeviceGemmBiasAddReduce_Xdl_CShuffle<128, 64, 128, 32, 8, 8> does not support this problem
launch_and_time_kernel: grid_dim {640, 1, 1}, block_dim {64, 1, 1} 
Warm up 1 time
Start running 10 times...
Gemm + reduce Perf:  0.0457441 msDeviceGemmBiasAddReduce_Xdl_CShuffle<64, 64, 64, 32, 8, 8>
launch_and_time_kernel: grid_dim {320, 1, 1}, block_dim {256, 1, 1} 
Warm up 1 time
Start running 10 times...
Gemm + reduce Perf:  0.0382721 msDeviceGemmBiasAddReduce_Xdl_CShuffle<256, 128, 64, 32, 8, 8>
DeviceGemmBiasAddReduce_Xdl_CShuffle<256, 64, 128, 32, 8, 8> does not support this problem
launch_and_time_kernel: grid_dim {640, 1, 1}, block_dim {128, 1, 1} 
Warm up 1 time
Start running 10 times...
Gemm + reduce Perf:  0.0492801 msDeviceGemmBiasAddReduce_Xdl_CShuffle<128, 128, 32, 32, 8, 8>
DeviceGemmBiasAddReduce_Xdl_CShuffle<128, 32, 128, 32, 8, 8> does not support this problem
launch_and_time_kernel: grid_dim {1280, 1, 1}, block_dim {64, 1, 1} 
Warm up 1 time
Start running 10 times...
Gemm + reduce Perf:  0.0515681 msDeviceGemmBiasAddReduce_Xdl_CShuffle<64, 64, 32, 32, 8, 8>
launch_and_time_kernel: grid_dim {1280, 1, 1}, block_dim {64, 1, 1} 
Warm up 1 time
Start running 10 times...
Gemm + reduce Perf:  0.0516961 msDeviceGemmBiasAddReduce_Xdl_CShuffle<64, 32, 64, 32, 8, 8>
launch_and_time_kernel: grid_dim {120, 1, 1}, block_dim {256, 1, 1} 
Warm up 1 time
Start running 10 times...
Normalize Perf:  0.0485601 ms
launch_and_time_kernel: grid_dim {120, 1, 1}, block_dim {256, 1, 1} 
Warm up 1 time
Start running 10 times...
Normalize Perf:  0.0615522 ms
launch_and_time_kernel: grid_dim {120, 1, 1}, block_dim {256, 1, 1} 
Warm up 1 time
Start running 10 times...
Normalize Perf:  0.0760002 ms
launch_and_time_kernel: grid_dim {120, 1, 1}, block_dim {256, 1, 1} 
Warm up 1 time
Start running 10 times...
Normalize Perf:   0.102016 ms

The result of layernorm

./ckProfiler layernorm 0 1 2 0 1 --length 8192 320
found 13 instances
...
length = 8192,320, stride = 320,1, reduce dims 1
best perf = 0.0288481 ms, 363.526 GB/s, DeviceNormalizationImpl<256,M_C8_S1,K_C32_S8,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>
  1. gemm_reduce doesn't have great performance. the performance time is microseconds(us). and skiplayernorm in ort is equal to add_add_layernorm in CK
 -- CK -- -- -- ORT -- -- --
MNK gemm_bias_add_reduce normalize gemm layernorm gemm+reduce+normalize gemm skiplayernorm gemm+skiplayernorm
[8192,320,320] 41.1 45 26.2 29 86.1 25.2 18.37 43.57
[2048,640,640] 42 30 25.6 16 72 23.6 9 32.6
[512,1280,1280] 47 21 32.8 10 68 21.4 7.71 29.11

PeixuanZuo avatar Jan 12 '23 05:01 PeixuanZuo