OpenBLAS Tuning for Apple chips

OpenBLAS can now be built for Apple chips using the port at https://github.com/iains/gcc-darwin-arm64. The build succeeds and seems to run fine, so it might be time to think about tuning for this microarchitecture. If anybody is interested in working on this issue, I can probably facilitate hardware access.

Sep 01 '20 22:09 Keno

There is no specific code for Apple's desktop-to-be processor over there. As far as internets tell - there is ISA profile present already.

Instruction set A64 – ARMv8.4-A

There is no specific support for big.LITTLE configuration, apple or no apples.

If you find apple's accelerate framework outperforming openblas, describe regression here, as usual. As hardware becomes available more people will actually use it and report what they find unfair.

Sep 02 '20 09:09 brada4

Well, that is why I said microarchitectural tuning, not architectural tuning ;) Accelerate will obviously be a good point of comparison.

Sep 02 '20 12:09 Keno

Does sysctl -n machdep.cpu.brand_string (as search engines tell me) return anything usable for identification ? (For #2804,I just made the code return ARMV8 by default). If it does, we could start with giving it its unique TARGET (which would allow assigning appropriate compiler options). Next trivial step could be to repurpose the ThunderX2T99 kernels (as they should be the most advanced we have right now) and see how they fare compared to generic ARMV8 or CortexA57.

Sep 02 '20 13:09 martin-frbg

sysctl machdep   
machdep.user_idle_level: 128
machdep.wake_abstime: 1258169647836
machdep.time_since_reset: 538131617
machdep.wake_conttime: 46263198095284
machdep.deferred_ipi_timeout: 64000
machdep.cpu.cores_per_package: 8
machdep.cpu.core_count: 8
machdep.cpu.logical_per_package: 8
machdep.cpu.thread_count: 8
machdep.cpu.brand_string: Apple processor
machdep.lck_mtx_adaptive_spin_mode: 1
machdep.virtual_address_size: 47

feature detection is in the hw sysctl though:

hw.ncpu: 8
hw.byteorder: 1234
hw.memsize: 17179869184
hw.activecpu: 8
hw.physicalcpu: 8
hw.physicalcpu_max: 8
hw.logicalcpu: 8
hw.logicalcpu_max: 8
hw.cputype: 16777228
hw.cpusubtype: 2
hw.cpu64bit_capable: 1
hw.cpufamily: 131287967
hw.cacheconfig: 8 1 1 0 0 0 0 0 0 0
hw.cachesize: 4008591360 131072 8388608 0 0 0 0 0 0 0
hw.pagesize: 16384
hw.pagesize32: 16384
hw.cachelinesize: 64
hw.l1icachesize: 131072
hw.l1dcachesize: 131072
hw.l2cachesize: 8388608
hw.tbfrequency: 24000000
hw.packages: 1
hw.osenvironment: 
hw.ephemeral_storage: 0
hw.use_recovery_securityd: 0
hw.use_kernelmanagerd: 1
hw.serialdebugmode: 0
hw.optional.floatingpoint: 1
hw.optional.watchpoint: 4
hw.optional.breakpoint: 6
hw.optional.neon: 1
hw.optional.neon_hpfp: 1
hw.optional.neon_fp16: 1
hw.optional.armv8_1_atomics: 1
hw.optional.armv8_crc32: 1
hw.optional.armv8_2_fhm: 0
hw.optional.amx_version: 0
hw.optional.ucnormal_mem: 0
hw.optional.arm64: 1
hw.targettype: J273a

Sep 02 '20 17:09 Keno

Thanks for the data, rough first draft is in #2816

Sep 02 '20 21:09 martin-frbg

Apple's M1 appears to offer Intel AMX-like capabilities accessed via an arm64 ISA extension. This extension is in use by Apple's own Accelerate framework. Prototype code for using these matrix intrinsics may be found here: https://gist.github.com/dougallj/7a75a3be1ec69ca550e7c36dc75e0d6f

Dec 28 '20 20:12 danielchalef

Thats not prototype code, nor intrimsic header of sorts, that is an earlu attempt to document an undocumented co-processor.

Dec 28 '20 21:12 brada4

Uh thanks, a deep link into a gist that looks as if it was supposed to be private, and has comments about being reverse-engineered from Apple's intellectual property ? I am not sure I would want to go there, least of all when nobody has even attempted to make proper use of what is openly available e.g. through benchmarking.

Dec 28 '20 21:12 martin-frbg

Thats not prototype code, nor intrimsic header of sorts, that is an earlu attempt to document an undocumented co-processor.

Ulp. You're right. I scanned the gist too quickly. That's not prototype code, rather a documentation effort.

Dec 28 '20 21:12 danielchalef

Uh thanks, a deep link into a gist that looks as if it was supposed to be private, and has comments about being reverse-engineered from Apple's intellectual property ? I am not sure I would want to go there, least of all when nobody has even attempted to make proper use of what is openly available e.g. through benchmarking.

Fair enough regarding the IP concern.

Dec 28 '20 21:12 danielchalef

If anyone could benchmark rosetta2 roughly and tell what works best from x86 world https://developer.apple.com/documentation/apple_silicon/about_the_rosetta_translation_environment#3616843

Dec 29 '20 14:12 brada4

@brada4 as I understand it Rosetta is more like a runtime x86 emulation environment to make x86 binaries run at all - I do not see how this would provide any insight compared to benchmarking the existing ARMV8 (and potentially thunderx2) kernels and working from there.

Dec 29 '20 15:12 martin-frbg

That emulated x86 will be around for 3-5 years (looking at "smooth" ppc to x86 transition years ago)

Dec 29 '20 22:12 brada4

Hello everyone, I had some fun the past days benchmarking R and Python, lately also compiled with openblas. See threads here https://twitter.com/dngman/status/1342580260815200257?s=20 and https://twitter.com/fxcoudert/status/1342598509418176514?s=20

If there is anything that I can do to accelerate this effort e.g. with testing etc. please let me know.

Dec 30 '20 14:12 dengemann

One trivial change to try (if you can spare the time) would be to edit kernel/arm64/KERNEL.VORTEX so that it includes either KERNEL.NEOVERSEN1 or KERNEL.THUNDERX3T110 instead of the more generic KERNEL.ARMV8 . This is just a stab in the dark though - no guarantees that this will actually make OpenBLAS faster, just that it would then use more recent BLAS kernels for server-class cpus rather than a smallest common denominator capable of running on old phones.

Dec 30 '20 14:12 martin-frbg

@martin-frbg is there a quick way to run a standard benchmark in openblas? I see there's a benchmark directory but not much info on how to use that…

Dec 30 '20 15:12 fxcoudert

No decent framework currently - just a bunch of individual files either inherited from GotoBLAS or inspired by those. Run make in the benchmark directory, and then execute one of the generated *.goto files with optional arguments of initial dimension, final dimension, step size, e.g. dlinpack.goto 1000 10000 50 to get a simple printout of problem size vs. MFlops to feed into e.g. gnuplot. The scripts subdirectory contains similarly trivial scripts for python,octave and r

Dec 30 '20 15:12 martin-frbg

I've run the benchmark suite on OpenBLAS (develop branch) compiled for:

arm64 / VORTEX with LLVM shipped with Big Sur
x86_64 using homebrew's gcc-10 toolchain and run using Rosetta. These tests were run twice in order to cache the translation.

For comparison, I've included results for veclib, where these benchmarks compiled cleanly. Many did not and some tests segfaulted.

You'll also note that many of the tests appear to have underruns. I've not yet had the opportunity to dig in to understand why this happened.

Results: https://github.com/danielchalef/openblas-benchmark-m1

Next up: Try the KERNEL.NEOVERSEN1 and KERNEL.THUNDERX3T110 replacement for ARMV8.

Dec 30 '20 15:12 danielchalef

Thanks - the underruns make me suspect that _POSIX_TIMERS (for clock_gettime presence) is not defined on Big Sur, which would make the benchmarks fall back to gettimeofday() with only millisecond resolution. For most but unfortunately not all of the benchmarks you can set the environment variable OPENBLAS_LOOPS to some "suitable" repeat value to get measurable execution times.

Dec 30 '20 16:12 martin-frbg

This version of benchmark/bench.h would probably work for OSX: bench.h.txt

Dec 30 '20 16:12 martin-frbg

This version of benchmark/bench.h would probably work for OSX: bench.h.txt

I get the following when making the tests with the modified bench.h:

./bench.h:82:21: error: expected parameter declarator
 mach_timebase_info(&info);
                    ^
./bench.h:82:21: error: expected ')'
./bench.h:82:20: note: to match this '('
 mach_timebase_info(&info);
                   ^
./bench.h:82:2: warning: type specifier missing, defaults to 'int' [-Wimplicit-int]
 mach_timebase_info(&info);
 ^
1 warning and 2 errors generated.
make: *** [sgemm.o] Error 1```

Dec 30 '20 17:12 danielchalef

Strange, looks like it ignored the declaration of info as a mach_timebase_info_data_t on the preceding line - but this is only cobbled together from various sources on the internet, not even compile-tested as I do not have any Apple hardware here.

Dec 30 '20 17:12 martin-frbg

@martin-frbg you can't call mach_timebase_info() outside of an actual function

Dec 30 '20 17:12 fxcoudert

right, @danielchalef can you move that line mach_timebase_info(&info); into the getsec() function immediately after the #elif defined(__APPLE__) there, please ?

Dec 30 '20 17:12 martin-frbg

CLOCK_REALTIME should be available, though, with microsecond resolution:

$ cat a.c
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>

int main (void){
  struct timespec tp;
  int result;

  result = clock_getres(CLOCK_REALTIME, &tp);
  printf("result: %d\n", result);
  printf("tp.tv_sec: %lld\n", (long long) tp.tv_sec);
  printf("tp.tv_nsec: %lld\n", (long long) tp.tv_nsec);
}
$ clang a.c && ./a.out
result: 0
tp.tv_sec: 0
tp.tv_nsec: 1000

Why it doesn't define _POSIX_TIMERS is beyond me…

Dec 30 '20 17:12 fxcoudert

right, @danielchalef can you move that line mach_timebase_info(&info); into the getsec() function immediately after the #elif defined(__APPLE__) there, please ?

The tests compiled. However, the math now appears off: dgemm.goto

          SIZE                   Flops             Time
 M=   1, N=   1, K=   1 :        0.00 MFlops 15125.000000 sec
 M=   2, N=   2, K=   2 :        0.00 MFlops 458.333333 sec
 M=   3, N=   3, K=   3 :        0.00 MFlops 458.333333 sec
 M=   4, N=   4, K=   4 :        0.00 MFlops 458.333333 sec
 M=   5, N=   5, K=   5 :        0.00 MFlops 583.333333 sec

Dec 30 '20 17:12 danielchalef

Off by only 1e9 probably (reporting nanoseconds instead of seconds), though something else seems to affect the very first call.

Dec 30 '20 18:12 martin-frbg

@martin-frbg Your suggestion to set OPENBLAS_LOOPS to a larger number works. I'll upload dgemm results later today.

Dec 30 '20 18:12 danielchalef

@fxcoudert from https://github.com/pocoproject/poco/issues/1453 apparently clock_gettime was added in OSX 10.12 but the presence of _POSIX_TIMERS may depend on the minimum SDK version setting at compile time (?) Anyway we'd probably want this to work with OSX < 10.12

Dec 30 '20 20:12 martin-frbg

dgemm results on a MacBook Pro M1. OpenBLAS compiled with Xcode / clang version 12.0.0. The test was run 10 times with the first run discarded. OPENBLAS_LOOPS was set to 20 in order to avoid the underflow discussed above.

OpenBLAS (with VORTEX/ ARMV8 kernel) vs Veclib

visualization

OpenBLAS VORTEX/ ARMV8 vs NEOVERSEN1 vs THUNDERX3T110 kernels (all on the M1):

A little difficult to see given the similarity in results and scale. See charts below for some interesting matrix dimension results.

visualization (2)

visualization (3)

to;dr Veclib significantly outperforms OpenBLAS, likely as it is using native, hardware-based matrix multiplication acceleration. The NEOVERSEN1 kernel appears to offer better results for the M1 than the default ARMV8 kernel.

Dec 30 '20 23:12 danielchalef