Action with Pangea-3 installation reproduction and ppc64le emulation
New job that:
-
emulates a ppc64 architecture (using the docker/setup-qemu-action that relies on the use of
qemuthrough theqemu-user-staticimage); -
deploys a AlmaLinux-8 image with prebuilt TPLs and Geos' dependencies installed with respect to the required pangea3 modules:
- CMake-3.26
- gcc-9.4.0
- ompi-4.1.2
- cuda-11.5.0
- openblas-0.3.18
- lsf-10.1
-
adds a
HOST_ARCHmatrix variable to the job matrix to trigger the installation and the use of the emulation layer; -
builds Geos and the unit tests on
streak-2self-hosted runner -
associated to the TPLs PR 257.
Remark Unit tests are not run because, due to the emulation layer, the GPUs cannot be used inside docker (the x86_64 drivers coming from the host are not usable inside the ppc64le image). Using them is theoritically possible but not so easy: if I am not wrong, at least one GPU has to be dedicated to the ppc64le image, the suitable drivers have to be installed and the GPU restarted with this driver.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 56.59%. Comparing base (
fda7ed2) to head (cfab88c). Report is 108 commits behind head on develop.
Additional details and impacted files
@@ Coverage Diff @@
## develop #3159 +/- ##
========================================
Coverage 56.59% 56.59%
========================================
Files 1064 1064
Lines 89752 89752
========================================
+ Hits 50791 50793 +2
+ Misses 38961 38959 -2
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@Algiane Sorry. We updated our TPL's and now there is a linking error on the Pangea3 build.
Remark:
- compilation is broken since commit 4bef600a0df
- only the ninja link edition/compilation is broken, the make build still works
Few notes:
- for now it is not possible to use another runner than
streak2-32coresbecause the compilation inside the emulated image is too slow; - the PR #3159 has been (partially) merged inside the current PR to solve a deadlock when using ninja to build the project (it allows the use of
unix makefilesinstead of ninja). It also reproduce exactly the environment used onpangea(we don't useninja); - we build static libraries because the shared library compilation has been broken for a while on
powerpc(link edition oflibgeos_corefail due toR_PPC_REL24relocation error;
I am moving this out of the merge queue coz it has conflicts and still needs code owner approval. Hopefully, after merging develop, you will be able to build this with shared libs and the compilation time will go down significantly coz having a 2.5h job taking the 32 cores runnes is not really sustainable IMO.
Hi @CusiniM ,
@CusiniM: The shared library compilation is broken on powerpc architecture since... a while... (due to symbols overflows when linking the library (it was already broken before the geosx->geos PR of the beginning of the month)).
@rrsettgast : Once again, the last modifications of the develop branch has broken the compilation for the Pangea3 users... even if it is easy to fix, it is really annoying.....
I understand that the PR asks time but it is just not possible for us to contribute to Geos and to focus on our researches:
- the
developbranch compilation has been broken 3 times in one month on P3 - when it is broken, the erroneous commits are not reverted
My personal opinion is that it worth to wait for few hours to check the ci success and allow other teams to work before merging errors in the develop. For people that are not members of LNLL, ci failure or not, we are waiting for weeks when it is not months before the integration of our work....
Erratum (I haven't looked at the new failure before writting):
- the compilation of the geos shared lib itself now works on P3 (commit
d8c31662): splitting the compilation into smaller parts should have solved to symbol overflow we were having; - the newly introduced error is different and more likely due to an issue when installing the built shared libraries or in the
LD_LIBRARY_PATHsetting.
Hello @Algiane, What is the the latest build error on P3?
I don't think @CusiniM is saying that we shouldn't have this PR merged, or that we don't want this PR merged. We are all agreed that the goal to have CI coverage for P3 is a worthy goal. We are trying to work around the bottleneck this PR will create in our CI workflow. While enabling of shared libraries for GEOS in #3282 needed to be done eventually, it was done now specifically to help you get around the linking problems you were seeing so that we could merge this PR.
The decision of integreating or not this PR is not my concern: it is between code owners and TotalEnergy managers.
Let me just gather here all the relevant informations (we had a lot of private exchanges).
Aim of the PR
The current ci process doesn't test the Pangea3 environment (ppc64le architecture, redhat-8 operating system, CPU and GPU compilers versions, pangea3 host-config file) :
- this environment is used by the researchers of the Inria-TotalEnergy team (~10 researchers) and maybe other people located in Pau and working or collaborating with TotalEnergy;
- currently a set of unit tests are failing on this environment and still have to be fixed (few in
Releasemode, more inDebug). It seems to be a recurring problem (#2031, #2985, #2552); - the compilation of the project is regularly broken (#3128, #3243, #3305 for recent examples);
- On pangea3, on 32 cores, the TPL + Geos build takes about 3h (variable depending on the machine workload, compilation options, etc...);
- Understanding and fixing bugs after their integration ask for developers / engineer skills that PhD students, non-permanent staff or researchers doesn't necessary have;
- Most of time, in the hurry, several people are looking for the same bug at the same time: it is a lost of time, of money and of human skills;
Today it is the main bottleneck of the Inria-TotalEnergy team.
Solutions that have been investigated
The main reproduction issue is the ppc64le architecture so I will focus on this specific issue.
Acces to a power-pc node
- at TotalE: not possible: self-hosted runners are not authorized to be triggered on the infra due to cybersecurity concerns, so it cannot be used with GEOS repo;
- at Inria / on national clusters: not possible: the ci request is too great for the resources at our disposal;
- on the cloud: I did a very quick search and couldn't find native powerpc hours but it could worth to investigate this a little more if a recurrent bill for cloud hours is possible;
Cross compilation
Dead-end because we will not be able to run the binary we will produce as the emulation is not easily compatible with the GPU use inside a docker image (it will also be very hard, if even possible, to have the entire project built and a hell to maintain).
Emulation of ppc64le architecture
It is the only solution that has been proven to be feasible and maintainable.
Main features are described in the inital PR descritption (https://github.com/GEOS-DEV/GEOS/pull/3159#issue-2333064608). AlmaLinux-8 is a 1:1 binary copy of RedHat and since it was ready, the PR highlighted all the regressions encountered on pangea3.
The docker image built by the TPL ci can directly be used to work in the P3 environment, either on a ppc arch, or on another arch with emulation.
Main drawback is the compilation slowdown: it has been evaluated and commented in February here: https://github.com/GEOS-DEV/thirdPartyLibs/pull/257#issuecomment-1964726949. You can expect a slown by a factor about 14;
Another possible issue can be the CI failure while the actual pangea3 build would succeed due to weird emulation errors (see PR #3276 : the error when linking the project with ninja seems to occurs only with the emulation layer). To my experience, this has only happened once in the last few months but I don't have the hindsight to know whether it will be a frequent problem. My proposition was to help to find and solve this kind of bugs and to remove the pangea3 job if emulation is found to be unstable
Slowdown issue and possible future CI bottleneck
For now we know that we have a development bottleneck creating a lost of time for developers. it can be solved but the solution will perhaps (probably) create a ci bottleneck: I consider the human time more valuable than the machine one but it is a personal opinion.
Comparative build time
Job configuration and duration (without cache):
| Job config | Ubuntu CUDA clang ci job |
Pangea3 ci job |
manual build on pangea 3 cluster |
|---|---|---|---|
| Runner | streak-1 | streak2-32core | |
| ncores used | 16 | 32 | 32 |
| emulation | no | yes | no |
| Job length (no cache) | Ubuntu CUDA clang ci job) |
Pangea3 ci job |
manual build on pangea 3 cluster |
|---|---|---|---|
geosx build |
1h10 | 1h43 | 32m49 |
| Unit tests build | 13m | 32m29 | 8m35 |
| Total build (+install) time | 1h26 | 2h20 | |
geosx build with wave solver only |
55m | 14m9 |
Possible solutions for the dev bottlneck and develop unstabilities
- much more rigorous PR reviews. Some errors that have been merged recently shouldn't have been validated by a review. Very probably, as these were very large PRs, not all the files were examined.
- PRs often contains unrelated modifications. If it can be admitted on very small PRs, it shouldn't on PR that modifies 50 or more files.
Possible solutions for the ci bottleneck
- take advantage from the code modularity and build only the
geosbinary and theWaveSolversolver; - do not run the
integrated testsjob (that run onstreak2too) when not needed; - setting job depencies in such a way that the Pange3 job will be triggered only when all other jobs succeed;
- adding a specific flag to manually trigger this job when everything else is OK;
- in both cases, the
all_job_succeedjob has to fail without the P3 job; - the purchase of a machine dedicated to this job and hosted and managed at inria is under discussion between managers.
Status of the PR
It is ready to be integrated since June (with no conflicts with the develop branch): I have worked to maintain, integrate the new developments (TPLs + Geos), to fix the introduced bugs and to resolve the introduced conflicts (I have passed more time to do that than develop the PR itself).
Concretely, the PR holds in ~10 lines pretty easy to understand. Everything is documented in the PR description.
The job I have been asked for is done so I'll leave you in charge to solve conflicts until you make up your mind and decide either to reject or to integrate this work.
Access to a power-pc node
GCP seems listening to the market: https://cloud.google.com/blog/products/compute/ibm-power-systems-now-available-on-google-cloud I don't think you may have access to GPU nodes but it may worth asking for pricing
Post https://github.com/GEOS-DEV/GEOS/pull/3159#issuecomment-2317222372 edited on September, the 12th, to add timer infos and new ideas in the "solutions" section.
@Algiane @rrsettgast I think there is good point here. It's not the first time that the computational expenses are highlighted. Maybe go a bit further with sccache and reorganize the workflows as proposed here https://github.com/GEOS-DEV/GEOS/issues/2579 Here https://github.com/GEOS-DEV/GEOS/pull/3266 is a first step of reorganization
@Algiane @sframba Now that we have merged #3340 , does this PR contain anything that you think needs to be merged?
No, sorry for the oversight