Building on linux with ppc64le CPU
I'm trying to build jax on a cluster that uses IBM power9 processors (it's a sister cluster to Summit at ORNL). It seems to be failing when trying to build XLA, which is strange because I've been able to install tensorflow just fine. The full output log is here: https://gist.github.com/f0uriest/5f04e2ed9916bb750a9ea679633ac80c
Any ideas? Is there any plan to offer pre-build wheels for ppc64le architecture?
We don't support the PPC architecture ourselves and most likely don't have the engineer bandwidth to maintain such a build.
But we wouldn't object if the community wanted to supported ppc64le. There are likely two pieces:
- making XLA work on PPC (probably mostly a matter of linking in the right LLVM backend in the TensorFlow/XLA tree)
- maintaining jaxlib builds.
Contributions welcome!
As to your specific question, I'd make sure that MKLDNN is disabled in the build (I believe there is an option to build.py for this.) I doubt MKLDNN works on non-Intel architectures.
I am working on the same issue. I managed to reach this point
https://gist.github.com/feifzhou/152d5c6e15e3485befa78e69cd340c32#file-gistfile1-txt
But got
- Warning about 404 ERROR while downloading a .gz file
- "failed: undeclared inclusion(s)" errors.
I tried both gcc 7.3.1 and 8.3.1 with same inclusion errors. Gcc 4.9.3 got me lots of syntax errors. MKLDNN was disabled.
As to your specific question, I'd make sure that MKLDNN is disabled in the build (I believe there is an option to build.py for this.) I doubt MKLDNN works on non-Intel architectures.
@feifzhou I don't know how to solve your issue, but it looks to me like Bazel isn't understanding something about the location of the standard library headers on your system. Do you have the same problem if you try to build TensorFlow? We share a lot of build infrastructure with them, so I'm wondering if this is JAX specific or a more general Bazel/TF problem.
(Ultimately we don't have cycles to work on this, but we welcome contributions!)
@f0uriest I've built v0.1.55 successfully on an IBM power 9 but more recent version fail in the same way
@mrorro Yeah I haven't been able to build any version since 0.1.55 either. It looks like at some point they switched some of the compiler flags to ones that are only defined for x86-64 architectures. Bazel supposedly lets you override these but I haven't gotten it to work yet.
@f0uriest If you can share the output of the build, we might be able to suggest things to change.
I'd speculate there are two or three things you'd need to do :
a) update build.py to pass the correct flags, if it isn't already doing so.
b) make sure XLA links in the Power LLVM backend if targeting Power. There are already cases for x86 and ARM; I don't recall if Power is included.
c) add a Power case to build_wheel.py.
@f0uriest @mrorro
Did you all happen to make progress with this issue? We are looking to build JAX on Summit and I happened upon this issue/discussion.
Nothing so far. Would love to see if you can solve it on Summit.
On Thu, Jul 22, 2021 at 5:38 PM proutrc @.***> wrote:
@f0uriest https://github.com/f0uriest @mrorro https://github.com/mrorro
Did you all happen to make progress with this issue? We are looking to build JAX on Summit and I happened upon this issue/discussion.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/jax/issues/4493#issuecomment-885327371, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADKT6A6WTKF7XN3XTRB2EA3TZC2ZJANCNFSM4SIJHG4A .
I also haven't made any progress but haven't had much time to work on it either. We've been using 0.1.55 for a while, though I'm hoping to upgrade later this summer
@hawkinsp on Summit, after fixing compiler flags, we are also getting the download error:
WARNING: Download from http://mirror.tensorflow.org/github.com/tensorflow/runtime/archive/d29d1ef0a65a8f9c23e1f88067ce4205d3085e87.tar.gz failed: class com.google.devtools.build.lib.bazel.repository.downloader.UnrecoverableHttpException GET returned 404 Not Found
@asedova That is a benign warning, you can ignore it.
I was able to cross-compile a ppc64le wheel on an x86-64 machine by following the instructions in #7365. I can't easily test the resulting wheel, though.
I would imagine that building natively on a ppc64le machine requires nothing other than following the standard instructions once the changes in #7365 are merged.
@asedova That is a benign warning, you can ignore it.
Thanks
I was able to cross-compile a ppc64le wheel on an x86-64 machine by following the instructions in #7365. I can't easily test the resulting wheel, though.
I would imagine that building natively on a ppc64le machine requires nothing other than following the standard instructions once the changes in #7365 are merged.
Thanks @hawkinsp we are eagerly awaiting this merge
One thing I'd like to double check: what does:
import platform
print(platform.machine())
print on your PPC machine?
And is it little endian?
On my system I get
>>> import platform
>>> print(platform.machine())
ppc64le
It is little-endian
yes we are LE also
@f0uriest you guys are on Sierra?
On Lassen
On Fri, Jul 23, 2021 at 8:56 AM asedova @.***> wrote:
@f0uriest you guys are on Sierra?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
@asedova Traverse at PPPL
#7365 is merged. Please try building jaxlib from git head. If it doesn't work, please post logs so I can try to debug it.
I could also share the cross-compiled wheel I made for Python 3.9; but I have no idea if it actually works. So it's probably best if you make sure it builds for you.
@hawkinsp Here is what I see on initial attempt: summit_jaxlib.log
Just for record, I have tried various versions of GCC (6.4.0, 7.4.0, 8.1.1)
Notable errors:
gcc: error: unrecognized command line option '-std=c++14'
ERROR: /tmp/_bazel_rprout/b2ebe10a0ad0f6175e81a930563cb9d3/external/com_google_protobuf/BUILD:155:11: Compiling src/google/protobuf/util/internal/datapiece.cc [for host] failed: (Exit 1): crosstool_wrapper_driver_is_not_gcc failed: error executing command (cd /tmp/_baz
This second error will point at different source files for different runs it seems.
@asedova do you see anything different?
You need a C++14 compiler to build JAX.
Something seems surprising here, though. gcc 6.1 and newer apparently support C++14: https://gcc.gnu.org/projects/cxx-status.html#cxx14
Note the documentation is quite clear that -std=c++14 is a flag gcc accepts! So this seems like something you need to figure out about your gcc installation.
@hawkinsp Apologies, I could have goofed that one actually. I thought had GCC loaded...
Here is a run with GCC 7.4.0:summit_jaxlib-gcc7.4.0.log
@proutrc The issue here is that bazel hermeticity checking is upset that you appear to be reading header files outside what it considers to be the standard system paths.
I think your best fix here might be to write a small custom Bazel toolchain. As it happens, I show an example of how to do that in a comment in #7365. It's not that bad, you should be able to just modify my example. You would need to modify cxx_builtin_include_directories to include that header directory, and you'd need to change the other tool paths to point to the right places on your system.
In your case, you'd want to set host_crosstool_top to the same toolchain as crosstool_top.
@hawkinsp sorry for my ignorance.. but, is the mentioned toolchain directory from the top of the jax repp or in the build directory? My familiarity with bazel and its setup is limited, unfortunately. I am happy to work on this though, just want to make sure I am setup properly.
@proutrc In the example I gave, it's at the root of the JAX repository. (It doesn't matter a whole lot, so long as all the paths agree, and in my command line, etc. I refer to it as //toolchain, which is at the root of the repository.)
@hawkinsp
I seem to still run into similar issues. Is there anything else I am missing, besides an update to those paths for the tools and the cxx_builtin_include_directories? I am also putting the realpath in the cxx_builtin_include_directories list, but I see it has the non-realpath in the error output. Sometimes it does have the realpath though, oddly. I appreciate your help.
def _impl(ctx):
return cc_common.create_cc_toolchain_config_info(
ctx = ctx,
features = features, # NEW
cxx_builtin_include_directories = [
"/autofs/nccs-svm1_sw/summit/gcc/7.4.0/lib/gcc/powerpc64le-none-linux-gnu/7.4.0/include/",
"/autofs/nccs-svm1_sw/summit/gcc/7.4.0/include/",
"/autofs/nccs-svm1_sw/summit/gcc/7.4.0/lib/gcc/powerpc64le-none-linux-gnu/7.4.0/include-fixed/",
"/usr/include/",
],
Error (it does seem to get further sometimes):
[0 / 31] [Prepa] Creating source manifest for //build:build_wheel ... (5 actions, 0 running)
[68 / 529] Compiling src/google/protobuf/generated_enum_util.cc [for host]; 2s local ... (128 actions running)
[76 / 529] Compiling src/google/protobuf/generated_enum_util.cc [for host]; 5s local ... (128 actions running)
[83 / 529] Compiling src/google/protobuf/extension_set.cc [for host]; 9s local ... (128 actions running)
[89 / 529] Compiling src/google/protobuf/extension_set.cc [for host]; 13s local ... (128 actions running)
[98 / 529] Compiling src/google/protobuf/extension_set.cc [for host]; 18s local ... (128 actions, 127 running)
ERROR: /gpfs/alpine/stf007/scratch/rprout/jax/jaxlib/BUILD:352:17: Compiling jaxlib/cpu_feature_guard.c failed: undeclared inclusion(s) in rule '//jaxlib:cpu_feature_guard.so':
this rule is missing dependency declarations for the following files included by 'jaxlib/cpu_feature_guard.c':
'/sw/summit/gcc/7.4.0/lib/gcc/powerpc64le-none-linux-gnu/7.4.0/include-fixed/limits.h'
'/sw/summit/gcc/7.4.0/lib/gcc/powerpc64le-none-linux-gnu/7.4.0/include-fixed/syslimits.h'
'/sw/summit/gcc/7.4.0/lib/gcc/powerpc64le-none-linux-gnu/7.4.0/include/stddef.h'
'/sw/summit/gcc/7.4.0/lib/gcc/powerpc64le-none-linux-gnu/7.4.0/include/stdarg.h'
'/sw/summit/gcc/7.4.0/lib/gcc/powerpc64le-none-linux-gnu/7.4.0/include/stdint.h'
Target //build:build_wheel failed to build
INFO: Elapsed time: 40.420s, Critical Path: 22.55s
INFO: 333 processes: 241 internal, 92 local.
FAILED: Build did NOT complete successfully
ERROR: Build failed. Not running target
FAILED: Build did NOT complete successfully
b''
Traceback (most recent call last):
File "build/build.py", line 604, in <module>
main()
File "build/build.py", line 599, in main
shell(command)
File "build/build.py", line 52, in shell
output = subprocess.check_output(cmd)
File "/sw/summit/python/3.7/anaconda3/5.3.0/lib/python3.7/subprocess.py", line 376, in check_output
**kwargs).stdout
File "/sw/summit/python/3.7/anaconda3/5.3.0/lib/python3.7/subprocess.py", line 468, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/sw/.testing/belhorn/summit/bin/bazel', 'run', '--verbose_failures=true', '--host_crosstool_top=//toolchain:ppc', '--crosstool_top=//toolchain:ppc', '--config=short_logs', '--config=cuda', '--define=xla_python_enable_gpu=true', ':build_wheel', '--', '--output_path=/gpfs/alpine/stf007/scratch/rprout/jax/dist', '--cpu=ppc64le']' returned non-zero exit status 1.
@proutc Try with --bazel_options=--cpu=ppc.