easybuild
easybuild copied to clipboard
Fujitsu toolchain TODO list
Following up from https://github.com/easybuilders/easybuild/issues/701
update June 2021: hoping for access to Deucalion's A64FX partition when ready, to see if the current draft implementation is sufficiently generic or not
update June 2024: access to Deucalion's A64FX partition confirms that the current draft implementation is not sufficiently generic
Framework
- [ ] since we can't rely on the permanence and uniformity of fujitsu language environment modules, shall we use an environment variable to pass the location to the compiler? use a metadata file? use
whichto find it? (draft at https://github.com/migueldiascosta/easybuild-framework/commit/b21987eac273cfd60c5084d20a797916f4dd0f18) - [x] HierarchicalMNS support: there isn't an mpi module, it's included in FCC (but we can could create a fake module to keep the same levels...)
- [ ] add
etc/fujitsu_external_modules_metadata.cfg?- for now, it would be simply
[lang/tcsds-1.2.31]\nname = lang\nversion = tcsds-1.2.31\nprefix = FJSVXTCLANGA
- for now, it would be simply
- [ ]
-SSL2*and-SCALAPACKshould be used only when linking but easybuilds prepends-LtoLDFLAGSvariables, so they are currently in compile flags (not a problem but generate warnings that pollute the logs) (https://github.com/easybuilders/easybuild-framework/issues/3700)- [ ] also, right now
-SSL2*flags are being duplicated, probably being set by both_set_blas_variablesand_set_lapack_variables, should be easy to fix
- [ ] also, right now
Easyblocks
- [ ] new easyblock for fcc in order to detect large page support and install compiler wrappers, PR upcoming (draft at https://github.com/migueldiascosta/easybuild-easyblocks/commit/642ff37fdfecfd7784e485ed4e3c2eeed6534b22)
Easyconfigs
- [ ] revert to
21.05inFCCandffmpieasyconfigs instead of4.5.0, since it seems we won't be able to pin the compiler version? - [x] binutils, Perl: use osdep for zlib to avoid warnings from Fugaku's large page allocation feature, PR upcoming
- [x] HDF5: PR upcoming
- [x]
hidden symbol__fixunstfsi' in /usr/lib/gcc/aarch64-redhat-linux/8/libgcc.a(fixunstfsi.o) is referenced by DSO`- we had added
--rtlib=compiler-rtto$LDFLAGSinM4because of https://bugs.llvm.org/show_bug.cgi?id=16404, we need to the the same here (this will likely pop up again...) - actually
--rtlib=compiler-rt -lgcc_s, because we still need other symbols from libgcc (e.g.unwind)
- we had added
- [x]
- [x] CMake: PR upcoming
- [x] linker flags and FindMPI patches
- [ ] installing with RPATH fails, apparently related to the static library libstdc++fs.a
- [x] LLVM: PR upcoming
- [x] using
-barepython dependency to avoid Rust, which itself builds LLVM - [x]
toolchainopts = {'cstd': 'gnu++11'}:fcconly acceptsgnuXX,FCConly acceptsgnu++XX, need to parseCFLAGSaccordingly in the framework toolchain definitions: https://github.com/easybuilders/easybuild-framework/pull/3731
- [x] using
- [x] Python: PR upcoming
- [x] Unzip Makefile has hardcoded
CC=cc, needs CC="$CC" buildopt (added to all other UnZip easyconfigs in https://github.com/easybuilders/easybuild-easyconfigs/pull/12887) - [x] Rust: fails with
thread 'main' panicked at 'couldn't find required command: "far"', src/bootstrap/sanity.rs:60:13; problem seems to be when "finding compilers"- [x]
cc_detecttries to inferarcommand name fromfccand comes up withfar, but only if AR environment variable is not set, so setting it inprebuildopts - [x] fails much later with
clang-7: error: unable to execute command: Killed; clang-7: error: clang frontend command failed due to signal...- this happens when Rust is building it's own LLVM
- (which is not honouring EB's
parallel, needs ~~prebuildopts += "export LLVM_PARALLEL_COMPILE_JOBS=%(parallel)s && "~~ Ninja, which can be added as builddep if it is modified to use python-bareas builddep) [x] - make Rust use EB LLVM 12, after moving LLVM's python builddep to-bare
- [x]
- Python sometimes (?) fails building
cryptographywitherror: cargo failed with code: -11 - in this particular version one can use
CRYPTOGRAPHY_DONT_BUILD_RUST=1if necessary...
- [x] Unzip Makefile has hardcoded
- [x] SciPy-bundle
- [x] numpy
- compiler detection fails because of warning message about Fugaku's large page allocation support
- this goes back to using zlib shared libraries, which requires PIC. But the warning doesn't show up when using the OS zlib, filtered out?
- using zlib from the OS
- [x] numpy 1.20.3 already has a 'fujitsu' fcompiler defined: https://github.com/numpy/numpy/pull/17792
- [x] add patch to filter executables
- [x] update
fortranpythonpackageeasyblock to pass--fcompiler=fujitsu: https://github.com/easybuilders/easybuild-easyblocks/pull/2434 - [x] patch lapack and blas detection to support SSL2
- [x] patch f2py tests to use
--fcompiler - [x] fatal error in test_cffi, extending using cffi leads to fatal error that crashes the test
- compiler detection fails because of warning message about Fugaku's large page allocation support
- [x] scipy
- [x] marking some tests as xfail
- [x] numpy
- [x] h5py: PR upcoming
- [ ] ELPA
- "The 'OPTIONAL' attribute must not be specified for the dummy argument 'success' of a procedure that has the procedure language binding specifier", unless
--disable-Fortran2008-features, but the Fujitsu Compiler is supposed to support it... - new configure opt
--enable-FUGAKUin2021.005.001, also--enable-sve-512 - but it still fails
- "The 'OPTIONAL' attribute must not be specified for the dummy argument 'success' of a procedure that has the procedure language binding specifier", unless
- [x] BerkeleyGW (without ELPA, for now): https://github.com/easybuilders/easybuild-easyconfigs/pull/12868 (needs to be updated)
- [x] update
berkeleygweasyblock: https://github.com/easybuilders/easybuild-easyblocks/pull/2428 - [x] check for fujitsu fftw in a different way (https://github.com/easybuilders/easybuild-easyblocks/pull/2428#discussion_r639502727)
- [x] update
Questions about Fujitsu ecosystem
i.e. how the environment will change in the future and if/how it differs across systems
Fugaku
- universality of the
lang/tcsdsmodules: are these specific to Fugaku or generic to other Fujitsu a64fx systems?- at Fugaku, we are using the
langmodule name (and one of the environment variables it sets,FJSVXTCLANGA, although this could be moved to the external module metadata file, using it to setprefixand then usingget_software_rootinstead), in the toolchain definitions in framework, and as an external module dependency in theFCCeasyconfig - response: "language environment is Fugaku specific, it cannot be used in other Fujitsu machines". So it does seem this is a "Fugaku" toolchain, not a "Fujitsu" toolchain
- at Fugaku, we are using the
- permanence of the
lang/tcsdsmodules: will they always be available?- at Fugaku, old modules that were only present in compute nodes have been removed, not sure if the ones that were also published in login nodes (as is the case of the
tcsds-1.2.31version that we are using for4.5.0/21.05) ever will - response: "The language environment (...) is retained for three versions including the latest version". Suggested that older versions are archived instead of deleted, i.e. not immediately visible but still available after some extra step, e.g. module use .... Otherwise, we'll need to remove the version pinning and revert to
FCC-21.05instead ofFCC-4.5.0, as a more recent module will change the compiler version...
- at Fugaku, old modules that were only present in compute nodes have been removed, not sure if the ones that were also published in login nodes (as is the case of the
Isambard
- environment module is
fujitsu-compiler/4.3.1(after a module use), this needs to be changed in the FCC easyconfig and in the toolchain definition...- the module doesn't set
FJSVXTCLANGA, so it needs to be set manually - this path is actually all that's needed, so since the environments differ, maybe we should simply rely on a single environment variable?
- the module doesn't set
- large page allocation doesn't seem to be enabled/supported ("libmpg BUG!! ... Assertion '0' failed.", setting
-Knolargepagebut again, we need a way of always injecting this without breaking scripts that expect CC to be only the executable... - the
fujitsu-compilermodule adds the top level include folder toC_INCLUDE_PATHandCPLUS_INCLUDE_PATH, but that breaks-Nclangmode, the wrong headers are included (in particulararm-sve.h)
Update June 2024:
Deucalion
- environment module is
FJSVstclanga/1.0.21.02a(which simply adds/opt/FJSVstclanga/cp-1.0.21.02a/{bin,lib64,man}to$PATH,$LD_LIBRARY_PATHand$MANPATH, plusUCX_RNDV_THRESH=64k)- the module also doesn't set
FJSVXTCLANGA, same as Isambard, so only the Fugaku module set it - the root path is actually all that's needed, so since the environments differ, maybe we should simply rely on a single environment variable?
- the module also doesn't set
- large page allocation (libmpg) is enabled, same as Fugaku, different from Isambard, so
-Klargepage, the default, can be used - numpy (<1.26) + ssl2 works very well, including multi-threaded, as long as python itself is built with the fujitsu compiler and linked with fjomplib
- using gcccore as a subtoolchain instead of building everything from fcc from scratch also works
- currently exploring trade-offs between "bottom-up" approach (build everything with fcc, better performance everywhere, but in most cases not by a lot, and lot more work supporting new versions) vs "top-down" approach (re-use gcccore (eventually from EESSI?)) and only rebuild what really benefits from the fujitsu compiler and libraries
- possibility of adapting FlexiBLAS to support SSL2, so that even gofbf doesn't need to be rebuilt? (FFTW can easily be overriden with fujitsu's fork)
- OpenMPI vs Fujitsu MPI may not be very relevant at Deucalion, since it has regular Infiniband, not TofuD like Fugaku
- for things that benefit from multithreaded SSL2 called from Python (e.g. numpy/scipy/etc.), one might as well use the "bottom-up" approach, since Python itself is pretty far "down"
- but for everything else, the "top-down" approach is currently looking more promising