cuda-samples Tensor cores not work as expected

I got the following results with cudatensorcoregemm by using high performance kernel and simple_wmma_gemm kernel. It seems that high performance kernel is worse than simply_emma gemm kernel,and I get same result with other Tensor core samples.

Initializing... GPU Device 0: "Ampere" with compute capability 8.6 M: 4096 (16 x 256) N: 4096 (16 x 256) K: 4096 (16 x 256) Preparing data for GPU... Required shared memory size: 64 Kb Computing... using high performance kernel compute_gemm Time: 540.459900 ms TFLOPS: 0.25

Initializing... GPU Device 0: "Ampere" with compute capability 8.6 M: 4096 (16 x 256) N: 4096 (16 x 256) K: 4096 (16 x 256) Preparing data for GPU... Required shared memory size: 64 Kb Computing... using simple_wmma_gemm kernel Time: 191.521790 ms TFLOPS: 0.72

system : windows11 GPU : NVIDIA GeFore RTX 3070 Laptop GPU cuda : 11.6

May 18 '22 02:05 Aikol

I got the same results with you, waiting for reply.

Jun 01 '22 07:06 iseanwang

SAME results i got too( same GPU : NVIDIA GeFore RTX 3070 Laptop GPU). Could you please check this??? @mdoijade @Ru7w1k @AndyDick thanks so much.

Jun 01 '22 08:06 FdyCN

@Aikol I switch the debug configurations to release configurations and get a reasonable result, hope it helps you too.

Jun 06 '22 01:06 iseanwang

@iseanwang I got the same result with you. Release configurations far outperforms debug configurations,and in release configurations,high performance kernel is better than simple kernel.This should be related to the optimization of debug(ex.O=2). But I still can't understand why high performance kernel and simple kernel have opposite behavior in release configuration and debug configuration.

Jun 07 '22 05:06 Aikol