jax icon indicating copy to clipboard operation
jax copied to clipboard

Building on linux with ppc64le CPU

Open f0uriest opened this issue 5 years ago • 70 comments

I'm trying to build jax on a cluster that uses IBM power9 processors (it's a sister cluster to Summit at ORNL). It seems to be failing when trying to build XLA, which is strange because I've been able to install tensorflow just fine. The full output log is here: https://gist.github.com/f0uriest/5f04e2ed9916bb750a9ea679633ac80c

Any ideas? Is there any plan to offer pre-build wheels for ppc64le architecture?

f0uriest avatar Oct 08 '20 05:10 f0uriest

We don't support the PPC architecture ourselves and most likely don't have the engineer bandwidth to maintain such a build.

But we wouldn't object if the community wanted to supported ppc64le. There are likely two pieces:

  • making XLA work on PPC (probably mostly a matter of linking in the right LLVM backend in the TensorFlow/XLA tree)
  • maintaining jaxlib builds.

Contributions welcome!

hawkinsp avatar Oct 08 '20 21:10 hawkinsp

As to your specific question, I'd make sure that MKLDNN is disabled in the build (I believe there is an option to build.py for this.) I doubt MKLDNN works on non-Intel architectures.

hawkinsp avatar Oct 08 '20 21:10 hawkinsp

I am working on the same issue. I managed to reach this point

https://gist.github.com/feifzhou/152d5c6e15e3485befa78e69cd340c32#file-gistfile1-txt

But got

  1. Warning about 404 ERROR while downloading a .gz file
  2. "failed: undeclared inclusion(s)" errors.

I tried both gcc 7.3.1 and 8.3.1 with same inclusion errors. Gcc 4.9.3 got me lots of syntax errors. MKLDNN was disabled.

As to your specific question, I'd make sure that MKLDNN is disabled in the build (I believe there is an option to build.py for this.) I doubt MKLDNN works on non-Intel architectures.

feifzhou avatar Feb 02 '21 05:02 feifzhou

@feifzhou I don't know how to solve your issue, but it looks to me like Bazel isn't understanding something about the location of the standard library headers on your system. Do you have the same problem if you try to build TensorFlow? We share a lot of build infrastructure with them, so I'm wondering if this is JAX specific or a more general Bazel/TF problem.

(Ultimately we don't have cycles to work on this, but we welcome contributions!)

hawkinsp avatar Feb 02 '21 16:02 hawkinsp

@f0uriest I've built v0.1.55 successfully on an IBM power 9 but more recent version fail in the same way

mrorro avatar Jun 09 '21 13:06 mrorro

@mrorro Yeah I haven't been able to build any version since 0.1.55 either. It looks like at some point they switched some of the compiler flags to ones that are only defined for x86-64 architectures. Bazel supposedly lets you override these but I haven't gotten it to work yet.

f0uriest avatar Jun 25 '21 16:06 f0uriest

@f0uriest If you can share the output of the build, we might be able to suggest things to change.

I'd speculate there are two or three things you'd need to do : a) update build.py to pass the correct flags, if it isn't already doing so. b) make sure XLA links in the Power LLVM backend if targeting Power. There are already cases for x86 and ARM; I don't recall if Power is included. c) add a Power case to build_wheel.py.

hawkinsp avatar Jun 26 '21 00:06 hawkinsp

@f0uriest @mrorro

Did you all happen to make progress with this issue? We are looking to build JAX on Summit and I happened upon this issue/discussion.

proutrc avatar Jul 23 '21 00:07 proutrc

Nothing so far. Would love to see if you can solve it on Summit.

On Thu, Jul 22, 2021 at 5:38 PM proutrc @.***> wrote:

@f0uriest https://github.com/f0uriest @mrorro https://github.com/mrorro

Did you all happen to make progress with this issue? We are looking to build JAX on Summit and I happened upon this issue/discussion.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/jax/issues/4493#issuecomment-885327371, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADKT6A6WTKF7XN3XTRB2EA3TZC2ZJANCNFSM4SIJHG4A .

feifzhou avatar Jul 23 '21 00:07 feifzhou

I also haven't made any progress but haven't had much time to work on it either. We've been using 0.1.55 for a while, though I'm hoping to upgrade later this summer

f0uriest avatar Jul 23 '21 02:07 f0uriest

@hawkinsp on Summit, after fixing compiler flags, we are also getting the download error: WARNING: Download from http://mirror.tensorflow.org/github.com/tensorflow/runtime/archive/d29d1ef0a65a8f9c23e1f88067ce4205d3085e87.tar.gz failed: class com.google.devtools.build.lib.bazel.repository.downloader.UnrecoverableHttpException GET returned 404 Not Found

asedova avatar Jul 23 '21 13:07 asedova

@asedova That is a benign warning, you can ignore it.

hawkinsp avatar Jul 23 '21 14:07 hawkinsp

I was able to cross-compile a ppc64le wheel on an x86-64 machine by following the instructions in #7365. I can't easily test the resulting wheel, though.

I would imagine that building natively on a ppc64le machine requires nothing other than following the standard instructions once the changes in #7365 are merged.

hawkinsp avatar Jul 23 '21 14:07 hawkinsp

@asedova That is a benign warning, you can ignore it.

Thanks

asedova avatar Jul 23 '21 14:07 asedova

I was able to cross-compile a ppc64le wheel on an x86-64 machine by following the instructions in #7365. I can't easily test the resulting wheel, though.

I would imagine that building natively on a ppc64le machine requires nothing other than following the standard instructions once the changes in #7365 are merged.

Thanks @hawkinsp we are eagerly awaiting this merge

asedova avatar Jul 23 '21 15:07 asedova

One thing I'd like to double check: what does:

import platform
print(platform.machine())

print on your PPC machine?

And is it little endian?

hawkinsp avatar Jul 23 '21 15:07 hawkinsp

On my system I get

>>> import platform
>>> print(platform.machine())
ppc64le

It is little-endian

f0uriest avatar Jul 23 '21 15:07 f0uriest

yes we are LE also

asedova avatar Jul 23 '21 15:07 asedova

@f0uriest you guys are on Sierra?

asedova avatar Jul 23 '21 15:07 asedova

On Lassen

On Fri, Jul 23, 2021 at 8:56 AM asedova @.***> wrote:

@f0uriest you guys are on Sierra?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

feifzhou avatar Jul 23 '21 15:07 feifzhou

@asedova Traverse at PPPL

f0uriest avatar Jul 23 '21 15:07 f0uriest

#7365 is merged. Please try building jaxlib from git head. If it doesn't work, please post logs so I can try to debug it.

I could also share the cross-compiled wheel I made for Python 3.9; but I have no idea if it actually works. So it's probably best if you make sure it builds for you.

hawkinsp avatar Jul 23 '21 16:07 hawkinsp

@hawkinsp Here is what I see on initial attempt: summit_jaxlib.log

Just for record, I have tried various versions of GCC (6.4.0, 7.4.0, 8.1.1)

Notable errors:

gcc: error: unrecognized command line option '-std=c++14'

ERROR: /tmp/_bazel_rprout/b2ebe10a0ad0f6175e81a930563cb9d3/external/com_google_protobuf/BUILD:155:11: Compiling src/google/protobuf/util/internal/datapiece.cc [for host] failed: (Exit 1): crosstool_wrapper_driver_is_not_gcc failed: error executing command (cd /tmp/_baz

This second error will point at different source files for different runs it seems.

@asedova do you see anything different?

proutrc avatar Jul 23 '21 17:07 proutrc

You need a C++14 compiler to build JAX.

Something seems surprising here, though. gcc 6.1 and newer apparently support C++14: https://gcc.gnu.org/projects/cxx-status.html#cxx14

Note the documentation is quite clear that -std=c++14 is a flag gcc accepts! So this seems like something you need to figure out about your gcc installation.

hawkinsp avatar Jul 23 '21 17:07 hawkinsp

@hawkinsp Apologies, I could have goofed that one actually. I thought had GCC loaded...

Here is a run with GCC 7.4.0:summit_jaxlib-gcc7.4.0.log

proutrc avatar Jul 23 '21 17:07 proutrc

@proutrc The issue here is that bazel hermeticity checking is upset that you appear to be reading header files outside what it considers to be the standard system paths.

I think your best fix here might be to write a small custom Bazel toolchain. As it happens, I show an example of how to do that in a comment in #7365. It's not that bad, you should be able to just modify my example. You would need to modify cxx_builtin_include_directories to include that header directory, and you'd need to change the other tool paths to point to the right places on your system.

In your case, you'd want to set host_crosstool_top to the same toolchain as crosstool_top.

hawkinsp avatar Jul 23 '21 17:07 hawkinsp

@hawkinsp sorry for my ignorance.. but, is the mentioned toolchain directory from the top of the jax repp or in the build directory? My familiarity with bazel and its setup is limited, unfortunately. I am happy to work on this though, just want to make sure I am setup properly.

proutrc avatar Jul 23 '21 17:07 proutrc

@proutrc In the example I gave, it's at the root of the JAX repository. (It doesn't matter a whole lot, so long as all the paths agree, and in my command line, etc. I refer to it as //toolchain, which is at the root of the repository.)

hawkinsp avatar Jul 23 '21 17:07 hawkinsp

@hawkinsp

I seem to still run into similar issues. Is there anything else I am missing, besides an update to those paths for the tools and the cxx_builtin_include_directories? I am also putting the realpath in the cxx_builtin_include_directories list, but I see it has the non-realpath in the error output. Sometimes it does have the realpath though, oddly. I appreciate your help.

def _impl(ctx):
   return cc_common.create_cc_toolchain_config_info(
       ctx = ctx,
       features = features, # NEW
       cxx_builtin_include_directories = [
          "/autofs/nccs-svm1_sw/summit/gcc/7.4.0/lib/gcc/powerpc64le-none-linux-gnu/7.4.0/include/",
          "/autofs/nccs-svm1_sw/summit/gcc/7.4.0/include/",
          "/autofs/nccs-svm1_sw/summit/gcc/7.4.0/lib/gcc/powerpc64le-none-linux-gnu/7.4.0/include-fixed/",
          "/usr/include/",
       ],

Error (it does seem to get further sometimes):

[0 / 31] [Prepa] Creating source manifest for //build:build_wheel ... (5 actions, 0 running)
[68 / 529] Compiling src/google/protobuf/generated_enum_util.cc [for host]; 2s local ... (128 actions running)
[76 / 529] Compiling src/google/protobuf/generated_enum_util.cc [for host]; 5s local ... (128 actions running)
[83 / 529] Compiling src/google/protobuf/extension_set.cc [for host]; 9s local ... (128 actions running)
[89 / 529] Compiling src/google/protobuf/extension_set.cc [for host]; 13s local ... (128 actions running)
[98 / 529] Compiling src/google/protobuf/extension_set.cc [for host]; 18s local ... (128 actions, 127 running)
ERROR: /gpfs/alpine/stf007/scratch/rprout/jax/jaxlib/BUILD:352:17: Compiling jaxlib/cpu_feature_guard.c failed: undeclared inclusion(s) in rule '//jaxlib:cpu_feature_guard.so':
this rule is missing dependency declarations for the following files included by 'jaxlib/cpu_feature_guard.c':
  '/sw/summit/gcc/7.4.0/lib/gcc/powerpc64le-none-linux-gnu/7.4.0/include-fixed/limits.h'
  '/sw/summit/gcc/7.4.0/lib/gcc/powerpc64le-none-linux-gnu/7.4.0/include-fixed/syslimits.h'
  '/sw/summit/gcc/7.4.0/lib/gcc/powerpc64le-none-linux-gnu/7.4.0/include/stddef.h'
  '/sw/summit/gcc/7.4.0/lib/gcc/powerpc64le-none-linux-gnu/7.4.0/include/stdarg.h'
  '/sw/summit/gcc/7.4.0/lib/gcc/powerpc64le-none-linux-gnu/7.4.0/include/stdint.h'
Target //build:build_wheel failed to build
INFO: Elapsed time: 40.420s, Critical Path: 22.55s
INFO: 333 processes: 241 internal, 92 local.
FAILED: Build did NOT complete successfully
ERROR: Build failed. Not running target
FAILED: Build did NOT complete successfully
b''
Traceback (most recent call last):
  File "build/build.py", line 604, in <module>
    main()
  File "build/build.py", line 599, in main
    shell(command)
  File "build/build.py", line 52, in shell
    output = subprocess.check_output(cmd)
  File "/sw/summit/python/3.7/anaconda3/5.3.0/lib/python3.7/subprocess.py", line 376, in check_output
    **kwargs).stdout
  File "/sw/summit/python/3.7/anaconda3/5.3.0/lib/python3.7/subprocess.py", line 468, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/sw/.testing/belhorn/summit/bin/bazel', 'run', '--verbose_failures=true', '--host_crosstool_top=//toolchain:ppc', '--crosstool_top=//toolchain:ppc', '--config=short_logs', '--config=cuda', '--define=xla_python_enable_gpu=true', ':build_wheel', '--', '--output_path=/gpfs/alpine/stf007/scratch/rprout/jax/dist', '--cpu=ppc64le']' returned non-zero exit status 1.

proutrc avatar Jul 23 '21 19:07 proutrc

@proutc Try with --bazel_options=--cpu=ppc.

hawkinsp avatar Jul 23 '21 20:07 hawkinsp