Slow performance about Gemm_add_add_layernorm

Open PeixuanZuo opened this issue 3 years ago • 0 comments

Use https://github.com/ROCmSoftwarePlatform/composable_kernel/tree/develop/client_example/03_gemm_layernorm and set b_only_run_first_kernel = false to run all instance.

There are two problems.

normalize performance is very slow, slower than layernorm. I found an existing comment point to this issue.

Use M = 8192, N = 320, K = 320, the result of gemm_add_add_reduce_normalize:

/composable_kernel/client_example/build/03_gemm_layernorm# ./client_gemm_add_add_reduce_normalize 
found 13 gemm_reduceMean_reduceSquareMean instances
found 4 normalize instances
DeviceGemmBiasAddReduce_Xdl_CShuffle<256, 256, 128, 32, 8, 8> does not support this problem
DeviceGemmBiasAddReduce_Xdl_CShuffle<256, 128, 256, 32, 8, 8> does not support this problem
DeviceGemmBiasAddReduce_Xdl_CShuffle<128, 128, 128, 32, 8, 8> does not support this problem
DeviceGemmBiasAddReduce_Xdl_CShuffle<256, 128, 128, 32, 8, 8> does not support this problem
launch_and_time_kernel: grid_dim {320, 1, 1}, block_dim {128, 1, 1} 
Warm up 1 time
Start running 10 times...
Gemm + reduce Perf:    0.04096 msDeviceGemmBiasAddReduce_Xdl_CShuffle<128, 128, 64, 32, 8, 8>
DeviceGemmBiasAddReduce_Xdl_CShuffle<128, 64, 128, 32, 8, 8> does not support this problem
launch_and_time_kernel: grid_dim {640, 1, 1}, block_dim {64, 1, 1} 
Warm up 1 time
Start running 10 times...
Gemm + reduce Perf:  0.0457441 msDeviceGemmBiasAddReduce_Xdl_CShuffle<64, 64, 64, 32, 8, 8>
launch_and_time_kernel: grid_dim {320, 1, 1}, block_dim {256, 1, 1} 
Warm up 1 time
Start running 10 times...
Gemm + reduce Perf:  0.0382721 msDeviceGemmBiasAddReduce_Xdl_CShuffle<256, 128, 64, 32, 8, 8>
DeviceGemmBiasAddReduce_Xdl_CShuffle<256, 64, 128, 32, 8, 8> does not support this problem
launch_and_time_kernel: grid_dim {640, 1, 1}, block_dim {128, 1, 1} 
Warm up 1 time
Start running 10 times...
Gemm + reduce Perf:  0.0492801 msDeviceGemmBiasAddReduce_Xdl_CShuffle<128, 128, 32, 32, 8, 8>
DeviceGemmBiasAddReduce_Xdl_CShuffle<128, 32, 128, 32, 8, 8> does not support this problem
launch_and_time_kernel: grid_dim {1280, 1, 1}, block_dim {64, 1, 1} 
Warm up 1 time
Start running 10 times...
Gemm + reduce Perf:  0.0515681 msDeviceGemmBiasAddReduce_Xdl_CShuffle<64, 64, 32, 32, 8, 8>
launch_and_time_kernel: grid_dim {1280, 1, 1}, block_dim {64, 1, 1} 
Warm up 1 time
Start running 10 times...
Gemm + reduce Perf:  0.0516961 msDeviceGemmBiasAddReduce_Xdl_CShuffle<64, 32, 64, 32, 8, 8>
launch_and_time_kernel: grid_dim {120, 1, 1}, block_dim {256, 1, 1} 
Warm up 1 time
Start running 10 times...
Normalize Perf:  0.0485601 ms
launch_and_time_kernel: grid_dim {120, 1, 1}, block_dim {256, 1, 1} 
Warm up 1 time
Start running 10 times...
Normalize Perf:  0.0615522 ms
launch_and_time_kernel: grid_dim {120, 1, 1}, block_dim {256, 1, 1} 
Warm up 1 time
Start running 10 times...
Normalize Perf:  0.0760002 ms
launch_and_time_kernel: grid_dim {120, 1, 1}, block_dim {256, 1, 1} 
Warm up 1 time
Start running 10 times...
Normalize Perf:   0.102016 ms

The result of layernorm

./ckProfiler layernorm 0 1 2 0 1 --length 8192 320
found 13 instances
...
length = 8192,320, stride = 320,1, reduce dims 1
best perf = 0.0288481 ms, 363.526 GB/s, DeviceNormalizationImpl<256,M_C8_S1,K_C32_S8,XYSrcVectorDim_1,VectorSize_X4_Gamma4_Beta4_Y4>

gemm_reduce doesn't have great performance. the performance time is microseconds(us). and skiplayernorm in ort is equal to add_add_layernorm in CK

--	CK	--	--	--	ORT	--	--	--
MNK	gemm_bias_add_reduce	normalize	gemm	layernorm	gemm+reduce+normalize	gemm	skiplayernorm	gemm+skiplayernorm
[8192,320,320]	41.1	45	26.2	29	86.1	25.2	18.37	43.57
[2048,640,640]	42	30	25.6	16	72	23.6	9	32.6
[512,1280,1280]	47	21	32.8	10	68	21.4	7.71	29.11

Jan 12 '23 05:01 PeixuanZuo