software-layer icon indicating copy to clipboard operation
software-layer copied to clipboard

Notes on building the stack for Fujitsu A64FX

Open trz42 opened this issue 9 months ago • 3 comments

The stack is being builded since summer 2024. Recently (April 2025) we have taken a shared bot instance on Deucalion into production. We'll use this issue to coordinate the building of the stack.


Short summary of current status

2025-04-11: In total 271 modules have been built with the toolchains foss/2023a and foss/2023b.

Remember some TODOs

  • [ ] Rebuild SitePackage.lua with changes done in https://github.com/EESSI/software-layer/pull/1022 (in that PR building for a64fx failed)

System toolchain & miscellaneous improvements

  • [x] #1025

trz42 avatar Apr 12 '25 21:04 trz42

Comment for installation of toolchain foss/2023a and all packages built with it.

To avoid re-building software within the 2023a stack, we have to be careful to use from-commitwhile building:

SciPy-bundle-2023.07-gfbf-2023a
R-bundle-CRAN-2023.12-foss-2023a
R-bundle-Bioconductor-3.18-foss-2023a-R-4.3.2
  • if these apps are already ingested, we need to figure out if we have to rebuild them

Apps from EB 4.8.2 2023a easystack

  • [x] ALREADY in /cvmfs: add GCCcore/12.3.0 (grace PR: #979; sapphirerapids PR: #925)
  • [x] ALREADY in /cvmfs: add foss/2023a (grace PR: #983; sapphirerapids PR: #925)
  • [ ] MOST apps are in /cvmfs
    • [x] remaining apps except SciPy-bundle rebuild
      • [x] split up into several:
        • [x] first Cython and TensorFlow dependencies: #1028
        • [x] all remaining apps except TensorFlow: #1033
        • [x] TensorFlow: #1034 (par=2)
          • first build interactively to figure out what eb parameters let it build successfully
            • 27 extensions need to be built: grpcio (8/27) requires quite a bit of time (1h33 (par=2) - 1h36 (par=8) in different runs)
      • grace PRs
        • #985
        • #986
      • sapphirerapids PR: #929
    • check if SciPy-bundle 2023.07 needs to be rebuilt -> separate PR
      • depends on Cython-3.0.8-GCCcore-12.3.0.eb
      • [ ] #1027
        • fakeroot is currently not configured for the build account => need to follow up later
    • Boost-1.82.0 is already in the stack, only the source URL had changed, so no need to do anything with it
    • What about Boost.Python? -> it doesn't seem to be built via EB 4.8.2

Apps from EB 4.9.0 2023a easystack

  • [x] #1038
    • grace PR: #990
    • sapphirerapids PR: #932
  • [x] #1040
    • grace PR: #991 + #995
    • sapphirerapids PR: #934
      • added from-commit to libxc-6.2.2
      • added from-commit to ParMETIS-4.0.3
  • [x] #1086

Apps from EB 4.9.1 2023a easystack

  • [x] done in several parts
    • [x] part 1: #1049
    • [x] part 2: #1060
    • grace PR: #998, #1007
    • sapphirerapids PR: #941, #943, #944

Apps from EB 4.9.2 2023a easystack

  • [ ] #1091
    • grace PR: #1002
    • sapphire rapids PRs: #942, #945, #954

trz42 avatar Apr 14 '25 11:04 trz42

Comment for installation of toolchain foss/2023b and all packages built with it.

Need to be careful when building the following packages because rebuilding doesn't work on Deucalion currently:

  • scikit-build-core-0.9.3-GCCcore-13.2.0.eb (see #957)

Apps from EB 4.9.0 2023b easystack

  • [x] add foss/2023b (grace PR: #982, sapphirerapids PR: #926)
  • [x] build in two parts
    • [x] part 1 includes most apps except Qt5: #1050
    • [x] part 2 includes Qt5: #1061
    • grace PR: #978
    • sapphirerapids PR: #931

Apps from EB 4.9.1 2023b easystack

  • [x] #1097
    • used from-commit for scikit-build-core-0.9.3-GCCcore-13.2.0.eb to avoid need for rebuild (see #957)
    • grace PR: #987
    • sapphirerapids PR: #936
  • [ ] GROMACS (skipped in the above because it failed in some tests)

Apps from EB 4.9.2 2023b easystack

  • [ ] apps originally built with EB 4.9.2 (grace PR: #989, sapphirerapids PR: #939)

Apps from EB 4.9.3 2023b easystack

  • [ ] apps originally built with EB 4.9.3 (grace PR: #992, sapphirerapids PR: #946)

Apps from EB 4.9.4 2023b easystack

  • [ ] a couple of apps built with EB 4.9.4 (grace PR: #993, sapphirerapids PR: #947)
  • [ ] a few more apps built with EB 4.9.4 (grace PR: #996, no sapphirerapids PR)

trz42 avatar Apr 20 '25 10:04 trz42

Comment for installation of toolchain foss/2022b and all packages built with it.

NVIDIA Grace // Sapphire Rapids PRs to consider (incl rebuilds):

  • #1003 // #923
    • includes Python-3.10.8-GCCcore-12.2.0-bare: check if EB 4.9.4 includes updated easyblock from PR 3352
  • #1003, #1005 // #927
    • includes Python-3.10.8-GCCcore-12.2.0: check if EB 4.9.4 includes updated easyblock from PR 3352
  • #1013, #1022 // #930
    • probably need to use from commit for 2 Boost easyconfigs
  • #1026 // #935
  • #938
    • make sure that we don't need to rebuild
    - Python-3.10.8-GCCcore-12.2.0-bare:
        options:
          # See https://github.com/easybuilders/easybuild-easyblocks/pull/3352
          include-easyblocks-from-commit: 1ee17c0f7726c69e97442f53c65c5f041d65c94f
    - Python-3.10.8-GCCcore-12.2.0:
        options:
          # See https://github.com/easybuilders/easybuild-easyblocks/pull/3352
          include-easyblocks-from-commit: 1ee17c0f7726c69e97442f53c65c5f041d65c94f
    
  • #1029, #1031 // #940

Need to be careful when building the following packages because rebuilding doesn't work on Deucalion currently:

  • Python-3.10.8-GCCcore-12.2.0-bare (see #938)
  • Python-3.10.8-GCCcore-12.2.0 (see #938)

Apps originally built with EB 4.8.2

  • [x] #1051
    • grace PR: #1003
    • sapphirerapids PRs: #923, #927
  • [ ] https://github.com/EESSI/software-layer/pull/1098
    • grace PR: #1003
    • sapphirerapids PRs: #927

Apps originally built with EB 4.8.2, and also built here with EB 4.8.2

  • [ ] Qt5 5.15.7
    • grace PR: #1005
    • sapphirerapids PR: #927
  • [ ] QuantumESPRESSO 7.2
    • grace PR: #1008
    • sapphirerapids PRs: #927

Apps originally built with EB 4.9.0

  • [ ] a bunch of apps (grace PR: #1013, sapphirerapids PR: #930)
    • [ ] update for SitePackage.lua so module load SciPy-bundle/2023.02-gfbf-2022b (included in PR 1013) prints a message about a unusually high number of failing unit tests (general PR #1022); the PR also fixes ReFrame test issues on Grace nodes

Apps originally built with EB 4.9.1

  • [ ] another set of apps (grace PR: #1026, sapphirerapids PR: #935)

Apps originally built with EB 4.9.2_

  • [ ] part 1: 47 of 70 apps (grace PR: #1029; sapphirerapids PR: #940)
  • [ ] part 2: 23 of 70 apps (grace PR: #1031; sapphirerapids PR: #940)

trz42 avatar Apr 27 '25 17:04 trz42

Adding BWA as part of initial test https://github.com/EESSI/software-layer/pull/1150

dagonzalezfo avatar Aug 11 '25 15:08 dagonzalezfo

About missing packages described here: https://github.com/EESSI/software-layer/issues/1024#issuecomment-2817106827

  • [ ] EB 4.9.2: https://github.com/EESSI/software-layer/pull/1151
  • [x] EB 4.9.2: https://github.com/EESSI/software-layer/pull/1176
  • [x] EB 4.9.3: https://github.com/EESSI/software-layer/pull/1152
  • [x] EB 4.9.4: https://github.com/EESSI/software-layer/pull/1153
  • [x] EB 4.9.4: https://github.com/EESSI/software-layer/pull/1154

dagonzalezfo avatar Aug 14 '25 12:08 dagonzalezfo

Only remaining:

  • [x] Batch of 4.9.0, numpy tests fail: https://github.com/EESSI/software-layer/pull/1187
  • [ ] GROMACS, tests failing: https://github.com/EESSI/software-layer/pull/1155
  • [x] Qt5 fails because of LLVM: https://github.com/EESSI/software-layer/pull/1177
  • [x] old-ish one, fails for unknown reason: https://github.com/EESSI/software-layer/pull/1091

hvelab avatar Sep 29 '25 08:09 hvelab

For that last one: I checked the logs of the last two builds jobs, and both just stopped somewhere in the middle. Slurm reports NODE_FAIL for both jobs though, so maybe they were taking up too many resources (disk space?). We might have to split that PR into smaller chunks.

bedroge avatar Sep 29 '25 09:09 bedroge

Just to log it here: I've just removed ParMETIS/4.0.3-gompi-2023b from the A64FX stack. This was installed, sort of, by accident in https://github.com/EESSI/software-layer/pull/1118: it was filtered for other CPU targets, but A64FX had an outdated EESSI-extend and because of that it was built as dependency of openCARP. The latter has been rebuilt in https://github.com/EESSI/software-layer/pull/1223 (also its dependencies PETSc and SuperLU_dist) to make sure they don't depend on ParMETIS anymore. Thus, it could be safely removed now, and this ensures that all stacks are similar again w.r.t. the ParMETIS situation.

bedroge avatar Oct 09 '25 18:10 bedroge

The a64fx stack is now on par with the others, except for the following three GROMACS versions:

  • GROMACS/2024.3-foss-2023b
  • GROMACS/2024.4-foss-2023b
  • GROMACS/2024.1-foss-2023b

These are failing due to a failing test, an issue has been opened by @dagonzalezfo: https://gitlab.com/gromacs/gromacs/-/issues/5440.

bedroge avatar Oct 21 '25 12:10 bedroge

The last missing application (GROMACS 2024.1) was added in https://github.com/EESSI/software-layer/pull/1308. This allowed the CI that compares stacks to pass, see https://github.com/EESSI/software-layer/pull/1221 🎉

So, A64FX is fully on par with the other stacks now, and this has been reflected in the docs: https://github.com/EESSI/docs/pull/625.

bedroge avatar Nov 18 '25 09:11 bedroge