llama2.c
llama2.c copied to clipboard
Use cblas for matrix multiplication
One potential optimization is to use a library such as OpenBLAS or Intel's MKL to perform the matrix multiplication in the matmul function.
#include <cblas.h>
void matmul(float* xout, float* x, float* w, int n, int d) {
// use the cblas_sgemv function from the BLAS library to perform the matrix multiplication
cblas_sgemv(CblasRowMajor, CblasNoTrans, d, n, 1.0, w, n, x, 1, 0.0, xout, 1);
}
the following OpenMP/MKL device off-loading will compute matmul() on GPU supported by OneAPI
adjust run.c
#include <mkl.h>
#include <mkl_omp_offload.h>
s->xb = mkl_calloc(p->dim, sizeof(float), 64);
s->xb2 = mkl_calloc(p->dim, sizeof(float), 64);
s->hb = mkl_calloc(p->hidden_dim, sizeof(float), 64);
s->hb2 = mkl_calloc(p->hidden_dim, sizeof(float), 64);
s->q = mkl_calloc(p->dim, sizeof(float), 64);
s->k = mkl_calloc(p->dim, sizeof(float), 64);
s->v = mkl_calloc(p->dim, sizeof(float), 64);
mkl_free(s->xb);
mkl_free(s->xb2);
mkl_free(s->hb);
mkl_free(s->hb2);
mkl_free(s->q);
mkl_free(s->k);
mkl_free(s->v);
void matmul(int n, int d, float xout[d], float x[n], float w[d * n]) {
#pragma omp target data map(to: x[0:n], w[0:d*n]) map(tofrom: xout[0:d])
{
#pragma omp dispatch
cblas_sgemv(CblasRowMajor, CblasNoTrans, d, n, 1.0, w, n, x, 1, 0.0, xout, 1);
}
}
build with Intel OneAPI
icx -Ofast -march=native -fiopenmp -fopenmp-targets=spir64 -qmkl -DMKL_ILP64 -fopenmp-version=51 -c run.c -o run.o
icx -Ofast -march=native -fiopenmp -fopenmp-targets=spir64 -qmkl -lOpenCL run.o -o run
example run on ARC A770
LIBOMPTARGET_DEVICETYPE=GPU ONEAPI_DEVICE_SELECTOR=level_zero:gpu ./run stories15M.bin
[...]
achieved tok/s: 14.000373
it's limited by the memory I/O between host and GPU.
some stats from LIBOMPTARGET_DEBUG=1
Target LEVEL0 RTL --> -- AllocMax=1(MB), Capacity=4, PoolSizeMax=256(MB)
Target LEVEL0 RTL --> -- : NewAlloc Reuse Hit(%)
Target LEVEL0 RTL --> -- Bucket[ 2048]: 1 12260 99.99
Target LEVEL0 RTL --> -- Bucket[ 4096]: 1 3293 99.97
Target LEVEL0 RTL --> -- Bucket[ 131072]: 1 182 99.45
Target LEVEL0 RTL --> -- Bucket[ 524288]: 1 4391 99.98
Target LEVEL0 RTL --> -- Bucket[ 1048576]: 1 3293 99.97
[...]
Target LEVEL0 RTL --> Memory usage for device memory, device 0x0000000002cb3880
Target LEVEL0 RTL --> -- Allocator: Native, Pool
Target LEVEL0 RTL --> -- Requested: 6753058816, 4419148416
Target LEVEL0 RTL --> -- Allocated: 6753058816, 5819271168
Target LEVEL0 RTL --> -- Freed : 6753058816, 5819271168
Target LEVEL0 RTL --> -- InUse : 0, 0
Target LEVEL0 RTL --> -- PeakUse : 43810816, 1054720
Target LEVEL0 RTL --> -- NumAllocs: 188, 23424
Loading the weights once into vram and not with every matmul() call improves the token speed to around 120-160/s.
Also setting MKL_VERBOSE=2 has been very helpful to confirm that calculation actually happens on GPU.