New job that:

emulates a ppc64 architecture (using the docker/setup-qemu-action that relies on the use of qemu through the qemu-user-static image);
deploys a AlmaLinux-8 image with prebuilt TPLs and Geos' dependencies installed with respect to the required pangea3 modules:
- CMake-3.26
- gcc-9.4.0
- ompi-4.1.2
- cuda-11.5.0
- openblas-0.3.18
- lsf-10.1
adds a HOST_ARCH matrix variable to the job matrix to trigger the installation and the use of the emulation layer;
builds Geos and the unit tests on streak-2 self-hosted runner
associated to the TPLs PR 257.

Remark Unit tests are not run because, due to the emulation layer, the GPUs cannot be used inside docker (the x86_64 drivers coming from the host are not usable inside the ppc64le image). Using them is theoritically possible but not so easy: if I am not wrong, at least one GPU has to be dedicated to the ppc64le image, the suitable drivers have to be installed and the GPU restarted with this driver.

Jun 04 '24 09:06 Algiane

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 56.59%. Comparing base (fda7ed2) to head (cfab88c). Report is 108 commits behind head on develop.

Additional details and impacted files

@@           Coverage Diff            @@
##           develop    #3159   +/-   ##
========================================
  Coverage    56.59%   56.59%           
========================================
  Files         1064     1064           
  Lines        89752    89752           
========================================
+ Hits         50791    50793    +2     
+ Misses       38961    38959    -2

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Jun 06 '24 09:06 codecov[bot]

@Algiane Sorry. We updated our TPL's and now there is a linking error on the Pangea3 build.

Jul 22 '24 21:07 rrsettgast

Remark:

compilation is broken since commit 4bef600a0df
only the ninja link edition/compilation is broken, the make build still works

Aug 09 '24 08:08 Algiane

Few notes:

for now it is not possible to use another runner than streak2-32cores because the compilation inside the emulated image is too slow;
the PR #3159 has been (partially) merged inside the current PR to solve a deadlock when using ninja to build the project (it allows the use of unix makefiles instead of ninja). It also reproduce exactly the environment used on pangea (we don't use ninja);
we build static libraries because the shared library compilation has been broken for a while on powerpc (link edition of libgeos_core fail due to R_PPC_REL24 relocation error;

Aug 19 '24 15:08 Algiane

I am moving this out of the merge queue coz it has conflicts and still needs code owner approval. Hopefully, after merging develop, you will be able to build this with shared libs and the compilation time will go down significantly coz having a 2.5h job taking the 32 cores runnes is not really sustainable IMO.

Aug 26 '24 18:08 CusiniM

Hi @CusiniM ,

@CusiniM: The shared library compilation is broken on powerpc architecture since... a while... (due to symbols overflows when linking the library (it was already broken before the geosx->geos PR of the beginning of the month)).

@rrsettgast : Once again, the last modifications of the develop branch has broken the compilation for the Pangea3 users... even if it is easy to fix, it is really annoying.....

I understand that the PR asks time but it is just not possible for us to contribute to Geos and to focus on our researches:

the develop branch compilation has been broken 3 times in one month on P3
when it is broken, the erroneous commits are not reverted

My personal opinion is that it worth to wait for few hours to check the ci success and allow other teams to work before merging errors in the develop. For people that are not members of LNLL, ci failure or not, we are waiting for weeks when it is not months before the integration of our work....

Aug 27 '24 08:08 Algiane

Erratum (I haven't looked at the new failure before writting):

the compilation of the geos shared lib itself now works on P3 (commit d8c31662): splitting the compilation into smaller parts should have solved to symbol overflow we were having;
the newly introduced error is different and more likely due to an issue when installing the built shared libraries or in the LD_LIBRARY_PATH setting.

Aug 27 '24 14:08 Algiane

Hello @Algiane, What is the the latest build error on P3?

I don't think @CusiniM is saying that we shouldn't have this PR merged, or that we don't want this PR merged. We are all agreed that the goal to have CI coverage for P3 is a worthy goal. We are trying to work around the bottleneck this PR will create in our CI workflow. While enabling of shared libraries for GEOS in #3282 needed to be done eventually, it was done now specifically to help you get around the linking problems you were seeing so that we could merge this PR.

Aug 27 '24 16:08 rrsettgast

The decision of integreating or not this PR is not my concern: it is between code owners and TotalEnergy managers.

Let me just gather here all the relevant informations (we had a lot of private exchanges).

Aim of the PR

The current ci process doesn't test the Pangea3 environment (ppc64le architecture, redhat-8 operating system, CPU and GPU compilers versions, pangea3 host-config file) :

this environment is used by the researchers of the Inria-TotalEnergy team (~10 researchers) and maybe other people located in Pau and working or collaborating with TotalEnergy;
currently a set of unit tests are failing on this environment and still have to be fixed (few in Release mode, more in Debug). It seems to be a recurring problem (#2031, #2985, #2552);
the compilation of the project is regularly broken (#3128, #3243, #3305 for recent examples);
On pangea3, on 32 cores, the TPL + Geos build takes about 3h (variable depending on the machine workload, compilation options, etc...);
Understanding and fixing bugs after their integration ask for developers / engineer skills that PhD students, non-permanent staff or researchers doesn't necessary have;
Most of time, in the hurry, several people are looking for the same bug at the same time: it is a lost of time, of money and of human skills;

Today it is the main bottleneck of the Inria-TotalEnergy team.

Solutions that have been investigated

The main reproduction issue is the ppc64le architecture so I will focus on this specific issue.

Acces to a power-pc node

at TotalE: not possible: self-hosted runners are not authorized to be triggered on the infra due to cybersecurity concerns, so it cannot be used with GEOS repo;
at Inria / on national clusters: not possible: the ci request is too great for the resources at our disposal;
on the cloud: I did a very quick search and couldn't find native powerpc hours but it could worth to investigate this a little more if a recurrent bill for cloud hours is possible;

Cross compilation

Dead-end because we will not be able to run the binary we will produce as the emulation is not easily compatible with the GPU use inside a docker image (it will also be very hard, if even possible, to have the entire project built and a hell to maintain).

Emulation of `ppc64le` architecture

It is the only solution that has been proven to be feasible and maintainable.

Main features are described in the inital PR descritption (https://github.com/GEOS-DEV/GEOS/pull/3159#issue-2333064608). AlmaLinux-8 is a 1:1 binary copy of RedHat and since it was ready, the PR highlighted all the regressions encountered on pangea3.

The docker image built by the TPL ci can directly be used to work in the P3 environment, either on a ppc arch, or on another arch with emulation.

Main drawback is the compilation slowdown: it has been evaluated and commented in February here: https://github.com/GEOS-DEV/thirdPartyLibs/pull/257#issuecomment-1964726949. You can expect a slown by a factor about 14;

Another possible issue can be the CI failure while the actual pangea3 build would succeed due to weird emulation errors (see PR #3276 : the error when linking the project with ninja seems to occurs only with the emulation layer). To my experience, this has only happened once in the last few months but I don't have the hindsight to know whether it will be a frequent problem. My proposition was to help to find and solve this kind of bugs and to remove the pangea3 job if emulation is found to be unstable

Slowdown issue and possible future CI bottleneck

For now we know that we have a development bottleneck creating a lost of time for developers. it can be solved but the solution will perhaps (probably) create a ci bottleneck: I consider the human time more valuable than the machine one but it is a personal opinion.

Comparative build time

Job configuration and duration (without cache):

Job config	`Ubuntu CUDA clang` ci job	`Pangea3` ci job	manual build on pangea 3 cluster
Runner	streak-1	streak2-32core
ncores used	16	32	32
emulation	no	yes	no

Job length (no cache)	`Ubuntu CUDA clang` ci job)	`Pangea3` ci job	manual build on pangea 3 cluster
`geosx` build	1h10	1h43	32m49
Unit tests build	13m	32m29	8m35
Total build (+install) time	1h26	2h20
`geosx` build with wave solver only		55m	14m9

Possible solutions for the dev bottlneck and `develop` unstabilities

much more rigorous PR reviews. Some errors that have been merged recently shouldn't have been validated by a review. Very probably, as these were very large PRs, not all the files were examined.
PRs often contains unrelated modifications. If it can be admitted on very small PRs, it shouldn't on PR that modifies 50 or more files.

Possible solutions for the ci bottleneck

take advantage from the code modularity and build only the geos binary and the WaveSolver solver;
do not run the integrated tests job (that run on streak2 too) when not needed;
setting job depencies in such a way that the Pange3 job will be triggered only when all other jobs succeed;
adding a specific flag to manually trigger this job when everything else is OK;
in both cases, the all_job_succeed job has to fail without the P3 job;
the purchase of a machine dedicated to this job and hosted and managed at inria is under discussion between managers.

Status of the PR

It is ready to be integrated since June (with no conflicts with the develop branch): I have worked to maintain, integrate the new developments (TPLs + Geos), to fix the introduced bugs and to resolve the introduced conflicts (I have passed more time to do that than develop the PR itself).

Concretely, the PR holds in ~10 lines pretty easy to understand. Everything is documented in the PR description.

The job I have been asked for is done so I'll leave you in charge to solve conflicts until you make up your mind and decide either to reject or to integrate this work.

Aug 29 '24 10:08 Algiane

Access to a power-pc node

GCP seems listening to the market: https://cloud.google.com/blog/products/compute/ibm-power-systems-now-available-on-google-cloud I don't think you may have access to GPU nodes but it may worth asking for pricing

Aug 30 '24 15:08 untereiner

Post https://github.com/GEOS-DEV/GEOS/pull/3159#issuecomment-2317222372 edited on September, the 12th, to add timer infos and new ideas in the "solutions" section.

Sep 12 '24 05:09 Algiane

@Algiane @rrsettgast I think there is good point here. It's not the first time that the computational expenses are highlighted. Maybe go a bit further with sccache and reorganize the workflows as proposed here https://github.com/GEOS-DEV/GEOS/issues/2579 Here https://github.com/GEOS-DEV/GEOS/pull/3266 is a first step of reorganization

Sep 12 '24 08:09 untereiner

@Algiane @sframba Now that we have merged #3340 , does this PR contain anything that you think needs to be merged?

Sep 23 '24 12:09 rrsettgast

No, sorry for the oversight

Sep 23 '24 12:09 Algiane

Action with Pangea-3 installation reproduction and ppc64le emulation

Codecov Report

Aim of the PR

Solutions that have been investigated

Acces to a power-pc node

Cross compilation

Emulation of ppc64le architecture

Slowdown issue and possible future CI bottleneck

Comparative build time

Possible solutions for the dev bottlneck and develop unstabilities

Possible solutions for the ci bottleneck

Status of the PR

Emulation of `ppc64le` architecture

Possible solutions for the dev bottlneck and `develop` unstabilities