BLASX
BLASX copied to clipboard
How GEMM is implemented?
I am wondering how the GEMM is implemented, is it like, CPU RAM store all the matrix A and B. Suppose we have 2 GPUs and we send A(i, k) and B(k, j) to GPU0 and we iterate all possible k, and we get a C(i, j) in GPU0. Similarly in GPU1. And we concatenate the result? If more complicated than that, do you have any reference paper? Thank you!!!