How GEMM is implemented?

Open Arsmart1 opened this issue 3 years ago • 0 comments

I am wondering how the GEMM is implemented, is it like, CPU RAM store all the matrix A and B. Suppose we have 2 GPUs and we send A(i, k) and B(k, j) to GPU0 and we iterate all possible k, and we get a C(i, j) in GPU0. Similarly in GPU1. And we concatenate the result? If more complicated than that, do you have any reference paper? Thank you!!!

Sep 30 '22 11:09 Arsmart1