Tensor cores not work as expected
I got the following results with cudatensorcoregemm by using high performance kernel and simple_wmma_gemm kernel. It seems that high performance kernel is worse than simply_emma gemm kernel,and I get same result with other Tensor core samples.
Initializing... GPU Device 0: "Ampere" with compute capability 8.6 M: 4096 (16 x 256) N: 4096 (16 x 256) K: 4096 (16 x 256) Preparing data for GPU... Required shared memory size: 64 Kb Computing... using high performance kernel compute_gemm Time: 540.459900 ms TFLOPS: 0.25
Initializing... GPU Device 0: "Ampere" with compute capability 8.6 M: 4096 (16 x 256) N: 4096 (16 x 256) K: 4096 (16 x 256) Preparing data for GPU... Required shared memory size: 64 Kb Computing... using simple_wmma_gemm kernel Time: 191.521790 ms TFLOPS: 0.72
system : windows11 GPU : NVIDIA GeFore RTX 3070 Laptop GPU cuda : 11.6
I got the same results with you, waiting for reply.
SAME results i got too( same GPU : NVIDIA GeFore RTX 3070 Laptop GPU). Could you please check this??? @mdoijade @Ru7w1k @AndyDick thanks so much.
@Aikol I switch the debug configurations to release configurations and get a reasonable result, hope it helps you too.
@iseanwang I got the same result with you. Release configurations far outperforms debug configurations,and in release configurations,high performance kernel is better than simple kernel.This should be related to the optimization of debug(ex.O=2). But I still can't understand why high performance kernel and simple kernel have opposite behavior in release configuration and debug configuration.