OpenBLAS OpenMP support AArch64 builds.
Currently, AArch64 builds rely on a single-threaded build of OpenBLAS, see: https://github.com/pytorch/builder/blob/8e799eb4708069db379dba20b1f324040f5e991e/build_aarch64_wheel.py#L182
Inclusion of OpenMP is marked as a TODO.
Enabling support should be a matter of adding the USE_OPENMP=1 flat at https://github.com/pytorch/builder/blob/8e799eb4708069db379dba20b1f324040f5e991e/build_aarch64_wheel.py#L182.
Note: When building PyTorch with an OpenBLAS myself, I also explicitly set OpenBLAS_HOME='/opt/OpenBLAS' BLAS='OpenBLAS' USE_MKLDNN=0 USE_OPENMP=1 USE_LAPACK=1 when running setup.py.
Can OpenMP support be enabled in build_aarch64_wheel.py?
Are there any issues currently blocking this change?
Note: this mirrors an issue raised in https://github.com/pytorch/builder/issues/679, however it is unrelated to the issues concerning choices of mcpu, mtune, and march, so I felt it would be beneficial to separate it out and address any specific issues separately.
@nSircombe last time I've tried enabling OpenMP support in OpenBLAS and linking statically, I run into can not allocate LTS runtime issue, but let me try again.
Hi @malfet, That's interesting, I have not hit that issue myself - I'd be very interested to know how you get on. If we can reproduce then I'm happy to dig a little deeper.
Hi @nSircombe, I was about to write, that I can not reproduce the problem, but the reason was that pretty simple: when I've enabled OpenMP in OpenBLAS, pytorch started to use Eigen instead. I use the following litmus test to check whether compilation with OpenBLAS was successful or not:
$ python -c "import torch;x=torch.rand(3,3);print(torch.__version__, torch.svd(torch.mm(x,x.t())))"
1.9.0.dev20210503 torch.return_types.svd(
U=tensor([[-0.5307, -0.0179, -0.8474],
[-0.6476, -0.6365, 0.4190],
[-0.5469, 0.7711, 0.3262]]),
S=tensor([1.7903, 0.2885, 0.0034]),
V=tensor([[-0.5307, -0.0179, -0.8474],
[-0.6476, -0.6365, 0.4190],
[-0.5469, 0.7711, 0.3262]]))
By the way, can you share a code snippet that triggers "Detect OpenMP Loop and this application may hang." warnings?
HI @malfet,
Your 'smoketest' works for me on my OpenMP-ed from-src build. I notice that the torch.__config__ on the released whl shows USE_EIGEN_FOR_BLAS=ON - this is not the case for my from-builds source.
can you share a code snippet that triggers "Detect OpenMP Loop and this application may hang."
Yep, I'll tidy it up and add something here.
Here's a reproducer for the "Detect OpenMP Loop and this application may hang." warning.
import torch
import urllib
import torchvision.models as models
from PIL import Image
from torchvision import transforms
url, fname = ("https://github.com/pytorch/hub/raw/master/images/dog.jpg", "dog.jpg")
urllib.request.urlretrieve(url, fname)
input_img = Image.open(fname)
pp_img = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]),
])
input_ten = pp_img(input_img)
input_bat = input_ten.unsqueeze(0)
model = models.resnet50(pretrained=True)
model.eval()
out = model(input_bat)
Running with anything other than OMP_NUM_THREADS=1 with the 1.8.1 wheel produces the warning for me. Builds with an OpenMP enabled OpenBLAS do not.
@nSircombe what compiler is openblas being built with. Is it being built with -moutline-atomics? We certainly saw issues with OpenBLAS pthreads with ld/st exclusives staving threads.
I've built with GCC 9.3 and 10.2, I'm not using -moutline-atomics.
Looks like TLS problem was fixed in OpenBLAS 0.3.15, so will try to enable the compilation OpenMP enabled OpenBLAS for 1.9.0 see https://github.com/pytorch/pytorch/pull/59428 and appropriate builder change
Hi @malfet,
How did you get on with the +OpenMP build following this change?
Does the resulting build pickup OpenBLAS as expected (and does not show USE_EIGEN_FOR_BLAS=ON in the config?)
and appropriate builder change
...would you be able to point me towards the builder change?
@nSircombe I've already made changes to builder (by just adding USE_OPENMP) and 1.9.2-rc2 builds should be available from https://download.pytorch.org/whl/test/cpu/torch_test.html
Would appreciate if you can smoke test them and let me know if they show perf gains you've expected from enabling OpenMP.
Great, thanks @malfet, I'll take the new whl for a spin and let you know how I get on.
Hi @malfet,
I've installed the whl from that url (pip install torch -f https://download.pytorch.org/whl/test/cpu/torch_test.html) but it looks like the whl has still picked up Eigen. torch.__config__.show() lists USE_EIGEN_FOR_BLAS=ON, and a simple check does not show torch using all the available threads (https://forums.developer.nvidia.com/t/pytorch-and-numpy-only-run-with-one-core/110430) - in my experience this happens when the linkage to OpenBLAS has quietly failed.
For my own builds from src. I'm currently using a build line like: OpenBLAS_HOME=$OPENBLAS_DIR BLAS="OpenBLAS" USE_OPENMP=1 USE_LAPACK=1 USE_CUDA=0 USE_FBGEMM=0 USE_DISTRIBUTED=0 python setup.py install and the result does not have USE_EIGEN_FOR_BLAS=ON (and with BLAS="OpenBLAS" set explicitly, fails if unable to link to OpenBLAS)
Hi @nSircombe,
Not sure how you've measured CPU utilization in the example you've referred to, I rely on time tool, which seem to show that torch.bmm is indeed parallelizable across cores:
$ uname -m; time python3 -c "import torch; print(torch.__version__, torch.get_num_threads());a=torch.rand(100, 100, 100); b=torch.rand(100,100, 100); [torch.bmm(a,b).sum() for i in range(1000)]"
aarch64
1.9.0 8
real 0m3.323s
user 0m23.233s
sys 0m0.168s
user time is roughly 8x larger than real time(and torch.get_num_threads() returned 8), which means torch.bmm was running on all available cores. And USE_EIGEN_FOR_BLAS=ON in the config is a red herring, if it were used, torch.svd, which relies on LAPACK code from OpenBLAS would not have worked.
Hi @malfet
torch.svd, which relies on LAPACK code from OpenBLAS would not have worked.
...that's a good point.
Not sure how you've measured CPU utilization...
I was just eye-balling htop while running some simple tests.
When I look at the output from your simple example, I see it reporting the current number of threads, but I only see 8 active in htop, and the timing is consistent with 8 threads. Even if I explicitly torch.set_num_threads(32) for example:
> uname -m; time python3 -c "import torch; torch.set_num_threads(32); print(torch.__version__, torch.get_num_threads());a=torch.rand(100, 100, 100); b=torch.rand(100,100, 100); [torch.bmm(a,b).sum() for i in range(1000)]"
aarch64
1.9.0 32
real 0m4.542s
user 0m34.504s
sys 0m3.137s
hi @malfet I'm observing the same issue as @nSircombe pointed out for openmp thread count, I'm using https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html I see the user process time scaling linearly with every thread till ~8, and then it gets capped. 1.9.0 16 real 0m3.284s user 0m26.842s sys 0m0.213s
However, the nightly is pointing to 1.9.0, is this the right version?
Thank you!
hi @malfet Are there 1.10.x nightly wheels available for aarch64 Linux? I would like to check the scaling behavior on the latest builds, as the issue is not observed on the local source builds.
Thank you!
Hi, @malfet & @snadampal,
I've noticed there appear to be nightly builds of 1.10 for x86 and macosx_arm64 - but no linux_aarch64 builds - is there a problem with this build at present? I was also hoping to look at the thread-scaling issue with the latest nightly.
I've run the build_aarch64_wheel.py, dispatching to a variety of instances (t4g, c6g) to see if this has any impact on the number of threads used by the final whl, however, the wheels generated are currently failing when I try and run a simple test ImportError: /home/ubuntu/python_venv/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so: cannot allocate memory in static TLS block. Is this a know issue?
Hi @nSircombe I have posted this PR to fix this CPU scaling issue on aarch64. https://github.com/pytorch/builder/pull/818
Root cause: While compiling OpenBLAS binary, if we don't explicitly define the max number of threads, it will limit it to the number cores on the host, snippet from OpenBLAS/Makefile.system
ifndef NUM_THREADS
NUM_THREADS = $(NUM_CORES)
endif
So, if you are building the wheel on 8 core instance, it will set NUM_THREADS=8, and the binary wheel will pose this limit on every instance irrespective of its core count. The fix is to explicitly pass NUM_THREADS=64 for OpenBLAS builds. Since this is only a Max limit there shouldn't be any performance issue on lower core count instances as long as application doesn't set the thread count greater than the native machine core count.
Thanks!
Ah! Good spot @snadampal. I'd tried builds restricting OMP_NUM_THREADS and saw no impact, but looking again, I'd missed that ProcessorCount returns the number of physical cores, so NUM_CORES would still have been set to all the available cores, not OMP_NUM_THREADS, which would explain a lot.
@snadampal thank you very much for the PR, please sign https://code.facebook.com/cla and I'll merge your PR.
@nSircombe regarding cannot allocate memory in static TLS block this is exactly the issue why I was hesitant to enable OpenMP in OpenBLAS, as it allocates quite a lot of TLS storage in
https://github.com/xianyi/OpenBLAS/blob/1b6db3dbba672b4f8af935bd43a1ff6cff4d20b7/driver/level2/gemv_thread.c#L93
this is exactly the issue why I was hesitant to enable OpenMP in OpenBLAS
Yes, I remember you mentioned it before.
This is the first time I've encountered it though, I'd not seen it when building locally, only now, trying to build 1.10 using the build_aarch64_wheel.py script on t4g and c6g.
@nSircombe, @malfet ,
I hit the cannot allocate memory in static TLS block issue from libtorch_cpu.so today, even while I'm building from sources locally (not with the build script). I workaround this by preloading the library.
export LD_PRELOAD=path/to/libtorch_cpu.so
btw, I have seen the TLS block memory error even without openmp backend for OpenBLAS.
@snadampal can you please share a bit more details about OS, python as well as master commit sha sum
@malfet , here are my setup details OS -> Linux / Ubuntu20.04 Python ->3.8 Pytorch --> commit --> bd9fad25c2646707c2a0fe8601bbd362610d0d9d OpenBLAS ->0.3.15