builder OpenBLAS OpenMP support AArch64 builds.

Currently, AArch64 builds rely on a single-threaded build of OpenBLAS, see: https://github.com/pytorch/builder/blob/8e799eb4708069db379dba20b1f324040f5e991e/build_aarch64_wheel.py#L182 Inclusion of OpenMP is marked as a TODO.

Enabling support should be a matter of adding the USE_OPENMP=1 flat at https://github.com/pytorch/builder/blob/8e799eb4708069db379dba20b1f324040f5e991e/build_aarch64_wheel.py#L182.

Note: When building PyTorch with an OpenBLAS myself, I also explicitly set OpenBLAS_HOME='/opt/OpenBLAS' BLAS='OpenBLAS' USE_MKLDNN=0 USE_OPENMP=1 USE_LAPACK=1 when running setup.py.

Can OpenMP support be enabled in build_aarch64_wheel.py? Are there any issues currently blocking this change?

Note: this mirrors an issue raised in https://github.com/pytorch/builder/issues/679, however it is unrelated to the issues concerning choices of mcpu, mtune, and march, so I felt it would be beneficial to separate it out and address any specific issues separately.

May 04 '21 12:05 nSircombe

@nSircombe last time I've tried enabling OpenMP support in OpenBLAS and linking statically, I run into can not allocate LTS runtime issue, but let me try again.

May 04 '21 18:05 malfet

Hi @malfet, That's interesting, I have not hit that issue myself - I'd be very interested to know how you get on. If we can reproduce then I'm happy to dig a little deeper.

May 04 '21 19:05 nSircombe

Hi @nSircombe, I was about to write, that I can not reproduce the problem, but the reason was that pretty simple: when I've enabled OpenMP in OpenBLAS, pytorch started to use Eigen instead. I use the following litmus test to check whether compilation with OpenBLAS was successful or not:

$ python -c "import torch;x=torch.rand(3,3);print(torch.__version__, torch.svd(torch.mm(x,x.t())))"
1.9.0.dev20210503 torch.return_types.svd(
U=tensor([[-0.5307, -0.0179, -0.8474],
        [-0.6476, -0.6365,  0.4190],
        [-0.5469,  0.7711,  0.3262]]),
S=tensor([1.7903, 0.2885, 0.0034]),
V=tensor([[-0.5307, -0.0179, -0.8474],
        [-0.6476, -0.6365,  0.4190],
        [-0.5469,  0.7711,  0.3262]]))

By the way, can you share a code snippet that triggers "Detect OpenMP Loop and this application may hang." warnings?

May 04 '21 23:05 malfet

HI @malfet,

Your 'smoketest' works for me on my OpenMP-ed from-src build. I notice that the torch.__config__ on the released whl shows USE_EIGEN_FOR_BLAS=ON - this is not the case for my from-builds source.

can you share a code snippet that triggers "Detect OpenMP Loop and this application may hang."

Yep, I'll tidy it up and add something here.

May 10 '21 16:05 nSircombe

Here's a reproducer for the "Detect OpenMP Loop and this application may hang." warning.

import torch
import urllib
import torchvision.models as models
from PIL import Image
from torchvision import transforms
 
url, fname = ("https://github.com/pytorch/hub/raw/master/images/dog.jpg", "dog.jpg")
urllib.request.urlretrieve(url, fname)
input_img = Image.open(fname)
pp_img = transforms.Compose([
  transforms.Resize(256),
  transforms.CenterCrop(224),
  transforms.ToTensor(),
  transforms.Normalize(mean=[0.485, 0.456, 0.406],
  std=[0.229, 0.224, 0.225]),
])

input_ten = pp_img(input_img)
input_bat = input_ten.unsqueeze(0)

model = models.resnet50(pretrained=True)
model.eval()

out = model(input_bat)

Running with anything other than OMP_NUM_THREADS=1 with the 1.8.1 wheel produces the warning for me. Builds with an OpenMP enabled OpenBLAS do not.

May 10 '21 21:05 nSircombe

@nSircombe what compiler is openblas being built with. Is it being built with -moutline-atomics? We certainly saw issues with OpenBLAS pthreads with ld/st exclusives staving threads.

May 13 '21 20:05 AGSaidi

I've built with GCC 9.3 and 10.2, I'm not using -moutline-atomics.

May 14 '21 07:05 nSircombe

Looks like TLS problem was fixed in OpenBLAS 0.3.15, so will try to enable the compilation OpenMP enabled OpenBLAS for 1.9.0 see https://github.com/pytorch/pytorch/pull/59428 and appropriate builder change

Jun 04 '21 13:06 malfet

Hi @malfet, How did you get on with the +OpenMP build following this change? Does the resulting build pickup OpenBLAS as expected (and does not show USE_EIGEN_FOR_BLAS=ON in the config?)

Jun 07 '21 16:06 nSircombe

and appropriate builder change

...would you be able to point me towards the builder change?

Jun 08 '21 12:06 nSircombe

@nSircombe I've already made changes to builder (by just adding USE_OPENMP) and 1.9.2-rc2 builds should be available from https://download.pytorch.org/whl/test/cpu/torch_test.html

Would appreciate if you can smoke test them and let me know if they show perf gains you've expected from enabling OpenMP.

Jun 09 '21 14:06 malfet

Great, thanks @malfet, I'll take the new whl for a spin and let you know how I get on.

Jun 09 '21 14:06 nSircombe

Hi @malfet,

I've installed the whl from that url (pip install torch -f https://download.pytorch.org/whl/test/cpu/torch_test.html) but it looks like the whl has still picked up Eigen. torch.__config__.show() lists USE_EIGEN_FOR_BLAS=ON, and a simple check does not show torch using all the available threads (https://forums.developer.nvidia.com/t/pytorch-and-numpy-only-run-with-one-core/110430) - in my experience this happens when the linkage to OpenBLAS has quietly failed.

For my own builds from src. I'm currently using a build line like: OpenBLAS_HOME=$OPENBLAS_DIR BLAS="OpenBLAS" USE_OPENMP=1 USE_LAPACK=1 USE_CUDA=0 USE_FBGEMM=0 USE_DISTRIBUTED=0 python setup.py install and the result does not have USE_EIGEN_FOR_BLAS=ON (and with BLAS="OpenBLAS" set explicitly, fails if unable to link to OpenBLAS)

Jun 10 '21 07:06 nSircombe

Hi @nSircombe,

Not sure how you've measured CPU utilization in the example you've referred to, I rely on time tool, which seem to show that torch.bmm is indeed parallelizable across cores:

$ uname -m; time python3 -c "import torch; print(torch.__version__, torch.get_num_threads());a=torch.rand(100, 100, 100); b=torch.rand(100,100, 100); [torch.bmm(a,b).sum() for i in range(1000)]"
aarch64
1.9.0 8

real	0m3.323s
user	0m23.233s
sys	0m0.168s

user time is roughly 8x larger than real time(and torch.get_num_threads() returned 8), which means torch.bmm was running on all available cores. And USE_EIGEN_FOR_BLAS=ON in the config is a red herring, if it were used, torch.svd, which relies on LAPACK code from OpenBLAS would not have worked.

Jun 11 '21 19:06 malfet

Hi @malfet

torch.svd, which relies on LAPACK code from OpenBLAS would not have worked.

...that's a good point.

Not sure how you've measured CPU utilization...

I was just eye-balling htop while running some simple tests. When I look at the output from your simple example, I see it reporting the current number of threads, but I only see 8 active in htop, and the timing is consistent with 8 threads. Even if I explicitly torch.set_num_threads(32) for example:

> uname -m; time python3 -c "import torch; torch.set_num_threads(32); print(torch.__version__, torch.get_num_threads());a=torch.rand(100, 100, 100); b=torch.rand(100,100, 100); [torch.bmm(a,b).sum() for i in range(1000)]"     
aarch64
1.9.0 32

real    0m4.542s
user    0m34.504s
sys     0m3.137s

Jun 16 '21 14:06 nSircombe

hi @malfet I'm observing the same issue as @nSircombe pointed out for openmp thread count, I'm using https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html I see the user process time scaling linearly with every thread till ~8, and then it gets capped. 1.9.0 16 real 0m3.284s user 0m26.842s sys 0m0.213s

However, the nightly is pointing to 1.9.0, is this the right version?

Thank you!

Jul 08 '21 17:07 snadampal

hi @malfet Are there 1.10.x nightly wheels available for aarch64 Linux? I would like to check the scaling behavior on the latest builds, as the issue is not observed on the local source builds.

Thank you!

Jul 17 '21 14:07 snadampal

Hi, @malfet & @snadampal,

I've noticed there appear to be nightly builds of 1.10 for x86 and macosx_arm64 - but no linux_aarch64 builds - is there a problem with this build at present? I was also hoping to look at the thread-scaling issue with the latest nightly. I've run the build_aarch64_wheel.py, dispatching to a variety of instances (t4g, c6g) to see if this has any impact on the number of threads used by the final whl, however, the wheels generated are currently failing when I try and run a simple test ImportError: /home/ubuntu/python_venv/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so: cannot allocate memory in static TLS block. Is this a know issue?

Jul 30 '21 07:07 nSircombe

Hi @nSircombe I have posted this PR to fix this CPU scaling issue on aarch64. https://github.com/pytorch/builder/pull/818

Root cause: While compiling OpenBLAS binary, if we don't explicitly define the max number of threads, it will limit it to the number cores on the host, snippet from OpenBLAS/Makefile.system

ifndef NUM_THREADS
NUM_THREADS = $(NUM_CORES)
endif

So, if you are building the wheel on 8 core instance, it will set NUM_THREADS=8, and the binary wheel will pose this limit on every instance irrespective of its core count. The fix is to explicitly pass NUM_THREADS=64 for OpenBLAS builds. Since this is only a Max limit there shouldn't be any performance issue on lower core count instances as long as application doesn't set the thread count greater than the native machine core count.

Thanks!

Jul 30 '21 14:07 snadampal

Ah! Good spot @snadampal. I'd tried builds restricting OMP_NUM_THREADS and saw no impact, but looking again, I'd missed that ProcessorCount returns the number of physical cores, so NUM_CORES would still have been set to all the available cores, not OMP_NUM_THREADS, which would explain a lot.

Jul 30 '21 15:07 nSircombe

@snadampal thank you very much for the PR, please sign https://code.facebook.com/cla and I'll merge your PR. @nSircombe regarding cannot allocate memory in static TLS block this is exactly the issue why I was hesitant to enable OpenMP in OpenBLAS, as it allocates quite a lot of TLS storage in https://github.com/xianyi/OpenBLAS/blob/1b6db3dbba672b4f8af935bd43a1ff6cff4d20b7/driver/level2/gemv_thread.c#L93

Jul 30 '21 15:07 malfet

this is exactly the issue why I was hesitant to enable OpenMP in OpenBLAS

Yes, I remember you mentioned it before. This is the first time I've encountered it though, I'd not seen it when building locally, only now, trying to build 1.10 using the build_aarch64_wheel.py script on t4g and c6g.

Jul 30 '21 16:07 nSircombe

@nSircombe, @malfet , I hit the cannot allocate memory in static TLS block issue from libtorch_cpu.so today, even while I'm building from sources locally (not with the build script). I workaround this by preloading the library. export LD_PRELOAD=path/to/libtorch_cpu.so

btw, I have seen the TLS block memory error even without openmp backend for OpenBLAS.

Aug 15 '21 00:08 snadampal

@snadampal can you please share a bit more details about OS, python as well as master commit sha sum

Aug 15 '21 00:08 malfet

@malfet , here are my setup details OS -> Linux / Ubuntu20.04 Python ->3.8 Pytorch --> commit --> bd9fad25c2646707c2a0fe8601bbd362610d0d9d OpenBLAS ->0.3.15

Aug 15 '21 01:08 snadampal