OpenBLAS icon indicating copy to clipboard operation
OpenBLAS copied to clipboard

Tuning for Apple chips

Open Keno opened this issue 5 years ago • 38 comments

OpenBLAS can now be built for Apple chips using the port at https://github.com/iains/gcc-darwin-arm64. The build succeeds and seems to run fine, so it might be time to think about tuning for this microarchitecture. If anybody is interested in working on this issue, I can probably facilitate hardware access.

Keno avatar Sep 01 '20 22:09 Keno

There is no specific code for Apple's desktop-to-be processor over there. As far as internets tell - there is ISA profile present already.

Instruction set A64 – ARMv8.4-A

There is no specific support for big.LITTLE configuration, apple or no apples.

If you find apple's accelerate framework outperforming openblas, describe regression here, as usual. As hardware becomes available more people will actually use it and report what they find unfair.

brada4 avatar Sep 02 '20 09:09 brada4

Well, that is why I said microarchitectural tuning, not architectural tuning ;) Accelerate will obviously be a good point of comparison.

Keno avatar Sep 02 '20 12:09 Keno

Does sysctl -n machdep.cpu.brand_string (as search engines tell me) return anything usable for identification ? (For #2804,I just made the code return ARMV8 by default). If it does, we could start with giving it its unique TARGET (which would allow assigning appropriate compiler options). Next trivial step could be to repurpose the ThunderX2T99 kernels (as they should be the most advanced we have right now) and see how they fare compared to generic ARMV8 or CortexA57.

martin-frbg avatar Sep 02 '20 13:09 martin-frbg

sysctl machdep   
machdep.user_idle_level: 128
machdep.wake_abstime: 1258169647836
machdep.time_since_reset: 538131617
machdep.wake_conttime: 46263198095284
machdep.deferred_ipi_timeout: 64000
machdep.cpu.cores_per_package: 8
machdep.cpu.core_count: 8
machdep.cpu.logical_per_package: 8
machdep.cpu.thread_count: 8
machdep.cpu.brand_string: Apple processor
machdep.lck_mtx_adaptive_spin_mode: 1
machdep.virtual_address_size: 47

feature detection is in the hw sysctl though:

hw.ncpu: 8
hw.byteorder: 1234
hw.memsize: 17179869184
hw.activecpu: 8
hw.physicalcpu: 8
hw.physicalcpu_max: 8
hw.logicalcpu: 8
hw.logicalcpu_max: 8
hw.cputype: 16777228
hw.cpusubtype: 2
hw.cpu64bit_capable: 1
hw.cpufamily: 131287967
hw.cacheconfig: 8 1 1 0 0 0 0 0 0 0
hw.cachesize: 4008591360 131072 8388608 0 0 0 0 0 0 0
hw.pagesize: 16384
hw.pagesize32: 16384
hw.cachelinesize: 64
hw.l1icachesize: 131072
hw.l1dcachesize: 131072
hw.l2cachesize: 8388608
hw.tbfrequency: 24000000
hw.packages: 1
hw.osenvironment: 
hw.ephemeral_storage: 0
hw.use_recovery_securityd: 0
hw.use_kernelmanagerd: 1
hw.serialdebugmode: 0
hw.optional.floatingpoint: 1
hw.optional.watchpoint: 4
hw.optional.breakpoint: 6
hw.optional.neon: 1
hw.optional.neon_hpfp: 1
hw.optional.neon_fp16: 1
hw.optional.armv8_1_atomics: 1
hw.optional.armv8_crc32: 1
hw.optional.armv8_2_fhm: 0
hw.optional.amx_version: 0
hw.optional.ucnormal_mem: 0
hw.optional.arm64: 1
hw.targettype: J273a

Keno avatar Sep 02 '20 17:09 Keno

Thanks for the data, rough first draft is in #2816

martin-frbg avatar Sep 02 '20 21:09 martin-frbg

Apple's M1 appears to offer Intel AMX-like capabilities accessed via an arm64 ISA extension. This extension is in use by Apple's own Accelerate framework. Prototype code for using these matrix intrinsics may be found here: https://gist.github.com/dougallj/7a75a3be1ec69ca550e7c36dc75e0d6f

danielchalef avatar Dec 28 '20 20:12 danielchalef

Thats not prototype code, nor intrimsic header of sorts, that is an earlu attempt to document an undocumented co-processor.

brada4 avatar Dec 28 '20 21:12 brada4

Uh thanks, a deep link into a gist that looks as if it was supposed to be private, and has comments about being reverse-engineered from Apple's intellectual property ? I am not sure I would want to go there, least of all when nobody has even attempted to make proper use of what is openly available e.g. through benchmarking.

martin-frbg avatar Dec 28 '20 21:12 martin-frbg

Thats not prototype code, nor intrimsic header of sorts, that is an earlu attempt to document an undocumented co-processor.

Ulp. You're right. I scanned the gist too quickly. That's not prototype code, rather a documentation effort.

danielchalef avatar Dec 28 '20 21:12 danielchalef

Uh thanks, a deep link into a gist that looks as if it was supposed to be private, and has comments about being reverse-engineered from Apple's intellectual property ? I am not sure I would want to go there, least of all when nobody has even attempted to make proper use of what is openly available e.g. through benchmarking.

Fair enough regarding the IP concern.

danielchalef avatar Dec 28 '20 21:12 danielchalef

If anyone could benchmark rosetta2 roughly and tell what works best from x86 world https://developer.apple.com/documentation/apple_silicon/about_the_rosetta_translation_environment#3616843

brada4 avatar Dec 29 '20 14:12 brada4

@brada4 as I understand it Rosetta is more like a runtime x86 emulation environment to make x86 binaries run at all - I do not see how this would provide any insight compared to benchmarking the existing ARMV8 (and potentially thunderx2) kernels and working from there.

martin-frbg avatar Dec 29 '20 15:12 martin-frbg

That emulated x86 will be around for 3-5 years (looking at "smooth" ppc to x86 transition years ago)

brada4 avatar Dec 29 '20 22:12 brada4

Hello everyone, I had some fun the past days benchmarking R and Python, lately also compiled with openblas. See threads here https://twitter.com/dngman/status/1342580260815200257?s=20 and https://twitter.com/fxcoudert/status/1342598509418176514?s=20

If there is anything that I can do to accelerate this effort e.g. with testing etc. please let me know.

dengemann avatar Dec 30 '20 14:12 dengemann

One trivial change to try (if you can spare the time) would be to edit kernel/arm64/KERNEL.VORTEX so that it includes either KERNEL.NEOVERSEN1 or KERNEL.THUNDERX3T110 instead of the more generic KERNEL.ARMV8 . This is just a stab in the dark though - no guarantees that this will actually make OpenBLAS faster, just that it would then use more recent BLAS kernels for server-class cpus rather than a smallest common denominator capable of running on old phones.

martin-frbg avatar Dec 30 '20 14:12 martin-frbg

@martin-frbg is there a quick way to run a standard benchmark in openblas? I see there's a benchmark directory but not much info on how to use that…

fxcoudert avatar Dec 30 '20 15:12 fxcoudert

No decent framework currently - just a bunch of individual files either inherited from GotoBLAS or inspired by those. Run make in the benchmark directory, and then execute one of the generated *.goto files with optional arguments of initial dimension, final dimension, step size, e.g. dlinpack.goto 1000 10000 50 to get a simple printout of problem size vs. MFlops to feed into e.g. gnuplot. The scripts subdirectory contains similarly trivial scripts for python,octave and r

martin-frbg avatar Dec 30 '20 15:12 martin-frbg

I've run the benchmark suite on OpenBLAS (develop branch) compiled for:

  • arm64 / VORTEX with LLVM shipped with Big Sur
  • x86_64 using homebrew's gcc-10 toolchain and run using Rosetta. These tests were run twice in order to cache the translation.

For comparison, I've included results for veclib, where these benchmarks compiled cleanly. Many did not and some tests segfaulted.

You'll also note that many of the tests appear to have underruns. I've not yet had the opportunity to dig in to understand why this happened.

Results: https://github.com/danielchalef/openblas-benchmark-m1

Next up: Try the KERNEL.NEOVERSEN1 and KERNEL.THUNDERX3T110 replacement for ARMV8.

danielchalef avatar Dec 30 '20 15:12 danielchalef

Thanks - the underruns make me suspect that _POSIX_TIMERS (for clock_gettime presence) is not defined on Big Sur, which would make the benchmarks fall back to gettimeofday() with only millisecond resolution. For most but unfortunately not all of the benchmarks you can set the environment variable OPENBLAS_LOOPS to some "suitable" repeat value to get measurable execution times.

martin-frbg avatar Dec 30 '20 16:12 martin-frbg

This version of benchmark/bench.h would probably work for OSX: bench.h.txt

martin-frbg avatar Dec 30 '20 16:12 martin-frbg

This version of benchmark/bench.h would probably work for OSX: bench.h.txt

I get the following when making the tests with the modified bench.h:

./bench.h:82:21: error: expected parameter declarator
 mach_timebase_info(&info);
                    ^
./bench.h:82:21: error: expected ')'
./bench.h:82:20: note: to match this '('
 mach_timebase_info(&info);
                   ^
./bench.h:82:2: warning: type specifier missing, defaults to 'int' [-Wimplicit-int]
 mach_timebase_info(&info);
 ^
1 warning and 2 errors generated.
make: *** [sgemm.o] Error 1```

danielchalef avatar Dec 30 '20 17:12 danielchalef

Strange, looks like it ignored the declaration of info as a mach_timebase_info_data_t on the preceding line - but this is only cobbled together from various sources on the internet, not even compile-tested as I do not have any Apple hardware here.

martin-frbg avatar Dec 30 '20 17:12 martin-frbg

@martin-frbg you can't call mach_timebase_info() outside of an actual function

fxcoudert avatar Dec 30 '20 17:12 fxcoudert

right, @danielchalef can you move that line mach_timebase_info(&info); into the getsec() function immediately after the #elif defined(__APPLE__) there, please ?

martin-frbg avatar Dec 30 '20 17:12 martin-frbg

CLOCK_REALTIME should be available, though, with microsecond resolution:

$ cat a.c
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>

int main (void){
  struct timespec tp;
  int result;

  result = clock_getres(CLOCK_REALTIME, &tp);
  printf("result: %d\n", result);
  printf("tp.tv_sec: %lld\n", (long long) tp.tv_sec);
  printf("tp.tv_nsec: %lld\n", (long long) tp.tv_nsec);
}
$ clang a.c && ./a.out
result: 0
tp.tv_sec: 0
tp.tv_nsec: 1000

Why it doesn't define _POSIX_TIMERS is beyond me…

fxcoudert avatar Dec 30 '20 17:12 fxcoudert

right, @danielchalef can you move that line mach_timebase_info(&info); into the getsec() function immediately after the #elif defined(__APPLE__) there, please ?

The tests compiled. However, the math now appears off: dgemm.goto

          SIZE                   Flops             Time
 M=   1, N=   1, K=   1 :        0.00 MFlops 15125.000000 sec
 M=   2, N=   2, K=   2 :        0.00 MFlops 458.333333 sec
 M=   3, N=   3, K=   3 :        0.00 MFlops 458.333333 sec
 M=   4, N=   4, K=   4 :        0.00 MFlops 458.333333 sec
 M=   5, N=   5, K=   5 :        0.00 MFlops 583.333333 sec

danielchalef avatar Dec 30 '20 17:12 danielchalef

Off by only 1e9 probably (reporting nanoseconds instead of seconds), though something else seems to affect the very first call.

martin-frbg avatar Dec 30 '20 18:12 martin-frbg

@martin-frbg Your suggestion to set OPENBLAS_LOOPS to a larger number works. I'll upload dgemm results later today.

danielchalef avatar Dec 30 '20 18:12 danielchalef

@fxcoudert from https://github.com/pocoproject/poco/issues/1453 apparently clock_gettime was added in OSX 10.12 but the presence of _POSIX_TIMERS may depend on the minimum SDK version setting at compile time (?) Anyway we'd probably want this to work with OSX < 10.12

martin-frbg avatar Dec 30 '20 20:12 martin-frbg

dgemm results on a MacBook Pro M1. OpenBLAS compiled with Xcode / clang version 12.0.0. The test was run 10 times with the first run discarded. OPENBLAS_LOOPS was set to 20 in order to avoid the underflow discussed above.

OpenBLAS (with VORTEX/ ARMV8 kernel) vs Veclib

visualization

OpenBLAS VORTEX/ ARMV8 vs NEOVERSEN1 vs THUNDERX3T110 kernels (all on the M1):

A little difficult to see given the similarity in results and scale. See charts below for some interesting matrix dimension results.

visualization (2)

visualization (3)

to;dr Veclib significantly outperforms OpenBLAS, likely as it is using native, hardware-based matrix multiplication acceleration. The NEOVERSEN1 kernel appears to offer better results for the M1 than the default ARMV8 kernel.

danielchalef avatar Dec 30 '20 23:12 danielchalef