llama2.c Use cblas for matrix multiplication

One potential optimization is to use a library such as OpenBLAS or Intel's MKL to perform the matrix multiplication in the matmul function.

#include <cblas.h>

void matmul(float* xout, float* x, float* w, int n, int d) {
    // use the cblas_sgemv function from the BLAS library to perform the matrix multiplication
    cblas_sgemv(CblasRowMajor, CblasNoTrans, d, n, 1.0, w, n, x, 1, 0.0, xout, 1);
}

Jul 30 '23 16:07 shamsburki

the following OpenMP/MKL device off-loading will compute matmul() on GPU supported by OneAPI

adjust run.c

#include <mkl.h>
#include <mkl_omp_offload.h>

s->xb = mkl_calloc(p->dim, sizeof(float), 64);
s->xb2 = mkl_calloc(p->dim, sizeof(float), 64);
s->hb = mkl_calloc(p->hidden_dim, sizeof(float), 64);
s->hb2 = mkl_calloc(p->hidden_dim, sizeof(float), 64);
s->q = mkl_calloc(p->dim, sizeof(float), 64);
s->k = mkl_calloc(p->dim, sizeof(float), 64);
s->v = mkl_calloc(p->dim, sizeof(float), 64);

mkl_free(s->xb);
mkl_free(s->xb2);
mkl_free(s->hb);
mkl_free(s->hb2);
mkl_free(s->q);
mkl_free(s->k);
mkl_free(s->v);

void matmul(int n, int d, float xout[d], float x[n], float w[d * n]) {
  #pragma omp target data map(to: x[0:n], w[0:d*n]) map(tofrom: xout[0:d])
  {
     #pragma omp dispatch
     cblas_sgemv(CblasRowMajor, CblasNoTrans, d, n, 1.0, w, n, x, 1, 0.0, xout, 1);
   }
}

build with Intel OneAPI

icx -Ofast -march=native -fiopenmp -fopenmp-targets=spir64 -qmkl -DMKL_ILP64 -fopenmp-version=51 -c run.c -o run.o
icx -Ofast -march=native -fiopenmp -fopenmp-targets=spir64 -qmkl -lOpenCL run.o -o run

example run on ARC A770

LIBOMPTARGET_DEVICETYPE=GPU ONEAPI_DEVICE_SELECTOR=level_zero:gpu ./run stories15M.bin
[...]
achieved tok/s: 14.000373

it's limited by the memory I/O between host and GPU.

some stats from LIBOMPTARGET_DEBUG=1

Target LEVEL0 RTL --> -- AllocMax=1(MB), Capacity=4, PoolSizeMax=256(MB)
Target LEVEL0 RTL --> --                   :   NewAlloc      Reuse     Hit(%)
Target LEVEL0 RTL --> -- Bucket[      2048]:          1      12260      99.99
Target LEVEL0 RTL --> -- Bucket[      4096]:          1       3293      99.97
Target LEVEL0 RTL --> -- Bucket[    131072]:          1        182      99.45
Target LEVEL0 RTL --> -- Bucket[    524288]:          1       4391      99.98
Target LEVEL0 RTL --> -- Bucket[   1048576]:          1       3293      99.97
[...]
Target LEVEL0 RTL --> Memory usage for device memory, device 0x0000000002cb3880
Target LEVEL0 RTL --> -- Allocator:       Native,         Pool
Target LEVEL0 RTL --> -- Requested:   6753058816,   4419148416
Target LEVEL0 RTL --> -- Allocated:   6753058816,   5819271168
Target LEVEL0 RTL --> -- Freed    :   6753058816,   5819271168
Target LEVEL0 RTL --> -- InUse    :            0,            0
Target LEVEL0 RTL --> -- PeakUse  :     43810816,      1054720
Target LEVEL0 RTL --> -- NumAllocs:          188,        23424

Aug 11 '23 02:08 leuc

Loading the weights once into vram and not with every matmul() call improves the token speed to around 120-160/s.

Also setting MKL_VERBOSE=2 has been very helpful to confirm that calculation actually happens on GPU.

Oct 01 '23 17:10 leuc