Scalability issue in decompInit_lnd
In the decompInit_lnd routine (in src/main/decompInitMod.F90), there's a performance issue in the section below:
https://github.com/ESCOMP/CTSM/blob/7ff60613ddb8235396e0098b67bf0dfa79195e28/src/main/decompInitMod.F90#L226-L236
This happens because the main loop, outside of that, is over every cell, and this internal loop is over the number of clumps, which is at least equal to the number of PEs. For km-scale runs, this ends up being quite large - eg, on the 3.75km test case, the outer loop is ~42M, and the inner loop is >~40K, since that's the minimum we're able to run the case on. The conditional does restrict the inner loop to only happening on ~12.4M of the 42M cells, but it's still a problem to have things scale by the number of cores.
In terms of data, the loop above took between 754 - 1044 seconds, averaging 804 across all PEs. That's ~86% of the InitializeRealize call.
I've got a few ideas on things to try, including simply changing the complex conditionals as a temporary work-around, as well as saving/reading a decomposition, but would welcome insights from folks who know the land model.
Perhaps the other issue here, which may be more important in the end, is that it seems like 'clumps' is allocated on every rank for every rank?
https://github.com/ESCOMP/CTSM/blob/7ff60613ddb8235396e0098b67bf0dfa79195e28/src/main/decompInitMod.F90#L118
Again, I welcome input here - this is a challenge for memory scalability, at least, and unless this is needed elsewhere, we should move to a local-to-the-rank structure.
Anyway, just getting the issue in - I think I can create work-arounds for the near-term needs, but would be happy to chat with any land folks on this and see if we can get a SIF or some other way to focus on addressing this soon, too.
Thanks!
Definition of done:
- [ ] Communicate with John about what we need to do
- [x] Erik learns about MPI_Scan from the article
- [ ] Possibly meet with John if we need to figure things out
- [ ] Start adding self-tests and self-test namelist control to test decomposition
- [ ] Add a unit test for decompInit?
- [ ] Can I add unit tests that run in parallel?
- [ ] Should we also have the ability to save the decomposition?
- [ ] Implement the ability to save the decomposition if desired
- [ ] Start implementing the solution on a branch
- [ ] Figure out appropriate testing to do for it (is there a way to do this in a unit test?)
- [ ] Do the needed testing
- [ ] Make sure the MPI_SCAN works in mpi-serial
- [ ] Upgrade to a PR and bring to CTSM master
Thanks for creating this issue, @briandobbins. Let us know the timeline that's helpful for this to be addressed. As you know, we're kind of slammed with prepping for CLM6 / CESM3, so addressing this after the release will be more realistic. That said, we don't want poor scalability hindering the work you're trying to do for high res work.
Great sleuthing, @briandobbins ! See also the doubly nested loop over clumps in decompInit_clumps, which should also maybe be addressed similarly (though it's not as bad as ngridcells*nclumps)?
@briandobbins do you need any of us to do anything with this now?
Not yet - this isn't critical for CESM3; I'll likely try to target this under some of the km-scale efforts in the near future, but wanted to make the issue so I don't forget it. Thanks!
@briandobbins, I think @ekluzek will have time to start on the simple things here over the next month. Maybe you can help coordinate with John Dennis (please add his github handle to this thread) with Erik to determine what this may look like. I'm less sure that Erik will have time to devote much time to the testing of this first phase over Sprint 18 (basically May).
@ekluzek can you create some child issues for these steps in improving our initialization?
Sure, sounds great -- I think from prior conversations with John, the init time issue is a 'simple thing' (in so far as anything is ever really 'simple'), but the memory savings would be more involved. Any progress is progress; even if we think it works in a narrow case (say, the CAM-MPAS decompositions we use for the km-scale runs), we can continue running / testing it there, with a big impact, until longer-term testing is viable. We did this with an ESMF patch too, until it could be brought in officially.
Let me know when you want Erik and I to start working on the Init time issue.
We had some email discussion about this that I'm adding here:
I'll add to this too that during Aaron Donahue's visit, we talked a bit about our initialization performance (because they have some issues too), and the land thing just came up in an email thread with him, but looking at their code, it seems they've addressed the performance issue - or, at least, they don't have the loop over ranks (clumps) within the loop over clumps. Here's a copy of the land decompInitMod.F90 file from E3SM v3.0.2 on Derecho:
/lustre/desc1/scratch/bdobbins/e3sm_land/E3SM-3.0.2/components/elm/src/main/decompInitMod.F90
I haven't looked into it enough to see if they're effectively doing John's solution, but I thought I'd just share in case it gives other ideas. Any solution is a good one!
Cheers,
- Brian
On Tue, May 27, 2025 at 10:40 AM John Dennis <[email protected]> wrote: Erik,
My calendar is up-to-date, so I would encourage you to find something in the next couple of days. I do have a bit of background reading for you on the prefix sum algorithm. Specifically, the first section of the medium
describes a prefix sum algorithm. Fundamentally, the existing slow loop in CTSM performs a prefix scan on each MPI rank. > Instead, we will be replacing the loop with an MPI_SCAN function and some form of gather.
John
To preserve the E3SM version that Brian had pointed to I'm adding the file here:
There are three MPI_Scan calls in the file:
!------------------------------------------------------------------------------
subroutine decompInit_lnd_using_gp(lni, lnj, cellsOnCell, ncells_loc, maxEdges, &
amask)
!
! !DESCRIPTION:
! This subroutine initializes the land surface decomposition into a clump
! data structure using graph partitioning approach. This assumes each pe
! has the same number of clumps set by clump_pproc.
.
.
.
! Determine the cell id offset on each processor
cell_id_offset = 0
call MPI_Scan(ncells_loc, cell_id_offset, 1, MPIU_INTEGER, MPI_SUM, mpicom, ierr)
cell_id_offset = cell_id_offset - ncells_loc
! Determine the total number of grid cells
call MPI_Allreduce(ncells_loc, ncells_tot, 1, MPIU_INTEGER, MPI_SUM, mpicom, ierr)
.
.
.
!
! Compute information for each processor
!
procinfo%ncells = ncells_owned
offset = 0
call MPI_Scan(ncells_owned, offset, 1, MPIU_INTEGER, MPI_SUM, mpicom, ierr)
procinfo%begg = offset + 1 - ncells_owned
offset = 0
call MPI_scan(ncells_owned, offset, 1, MPIU_INTEGER, MPI_SUM, mpicom, ierr)
procinfo%endg = offset
There's a bunch of MCT glue around that. So our solution will need to work with ESMF rather than MCT, and might change how some of this works for us.
Looking the code over, what I see that needs to be done:
The core of this is changing the subroutine:
decompInit_lnd
so that instead of doing clumps over every tasks for all clumps globally, it needs to be done over each task. There will be MPI communication and MPI reduction operations (i.e. MPI_SCAN) in order to get this part to work.
Then subroutine:
decompInit_clumps
is changed so that it only allocates memory over the clumps for the given task. This part should be straightforward. It also could be done before the changes to decompInit_lnd as each task should only use it's own clump information and shouldn't need any information on other clumps.
This subroutine should also be refactored to have the printing of the decomposition in it's own subroutine and that part should be done in such a way that each task prints it's info and waits for the next one. We probably should add more control over when and how it's printed as well.
I think subroutine decompInit_glcp won't need to be changed. It already has MPI communication going on, and as such I think it probably doesn't need to be touched I also don't see any global memory usage in it, whereas the first two subroutines mentioned above do.
@johnmauff and I met last week. He took me through this document:
https://docs.google.com/document/d/1Q3QupotOTLtmf4KMsbu16dMccuNVHlfXp1JnEu1fkdw/
The goal stated above is:
GOAL: Memory usage and initialization time should be nearly constant
Right now both memory and initialization time increases rapidly with the number of tasks. In the end we should be able to get it pretty close to being flat for both.
We figure we will make small incremental improvements that can probably come in on b4b-dev. There might also be a long standing branch for any parts that can't come in. But, we'll only do that if it's really needed. One thing I wonder about doing that for right now, is to have some system tests that check the grids we want. But, even that could likely be put on a different test list and still come to main-dev. The other thing is the memory evaluation code, that is something that shouldn't come to main-dev as is. So there likely will need to be a branch for it, so it can be easily used.
We figured for testing we probably should do these lower resolutions to start: ne3pg{2,3}, ne30pg3
And then move on to higher resolutions: ne120, mpas15, At the end we then verify we have improvements in the ultra high resolution grids: mpas7p5, mpas3p75
I also talked about my idea of possibly adding unit-testing, but for sure adding some self-test control, so that it checks the decomposition for validity, as well as doing things, like shut down after the decomposition is done. So we'll plan on that and see how it pans out.
@johnmauff also has some code that gets more detailed memory usage. Here's his notes on that:
I have placed the memory usage code that I have been using to identify memory leaks in CAM. I have placed it in my home directory on Derecho.
~dennis/MemoryLeak/share/proc_status_vm.F90
I place this file in the "share/src/" directory, which allows it to be accessible by all component models. I typically only call the subroutine from a single MPI rank. For example:
if(masterproc) then
call prt_vm_status('CTSM: decompInit_lnd: point #1')
endif
It will print out a detailed memory usage that can then be plotted using a Python script.
Right now this is a simple one off to help in cases on Derecho like this. But, longer term we should examine if the gptl timers already have this capability or if this could be added to them if not. The solution is general for LINUX type machines (relying on how they store this information in files under the /proc directory), so may be general enough to come in close to as is.
In any case, since this has been an important tool for his work, this is something we'll be using for evaluation of this work.
From talking with @johnmauff today, we realized we don't need three mpi_scans, as endg can just be set from the 2nd mpi_scan
So the last bits of the code above become:
!
! Compute information for each processor
!
procinfo%ncells = ncells_owned
offset = 0
call MPI_Scan(ncells_owned, offset, 1, MPIU_INTEGER, MPI_SUM, mpicom, ierr)
procinfo%endg = offset
procinfo%begg = offset + 1 - ncells_owned
We also realized the graph partitioning done in the ELM code doesn't need to be done either.
The other thing of note from the meeting with @johnmauff is that there's some non-scalable memory outside of decompInit, and there are some timing issues outside of the decompInit in initialize2 (technically also initialize1, but it's an order of magnitude smaller).
See:
https://docs.google.com/document/d/1Q3QupotOTLtmf4KMsbu16dMccuNVHlfXp1JnEu1fkdw/edit?tab=t.0#heading=h.n72s2q6lmmcc
@johnmauff and I in our standup this morning went over the E3SM example code to figure out what we need to do here. Currerntly every processor has a copy of the complete map of processors to grid-cell and the cells that each other MPI task manages. And every processor repeats the same calculation for the entire grid.
What we will be doing is going to a framework like this:
- Figure out how many gridcells each MPI task will handle -- on all tasks (avoid doing a MPI broadcast)
- Each tasks preserves it's info on how many gridcells it will handle
- Use the MPI_SCAN to figure out the offsets so you can get the global index of begg and endg for my task
At that point you can fill out all the information for the local task.
Which means ultimately we can have structures where each task only has the information it needs locally about the beginning and ending points. But, to do this in steps to make sure we don't mess things up, we will likely remove global structures one by one. The E3SM example shows this as well, because each task retains the global clump structure even though ultimately it won't need it.
I'm going to make some design notes on all this and get started on doing some of this, this week.
@ekluzek would it be possible to figure out how many grid cells each MPI task will handle on rank? If it is a deterministic algorithm, it could be done on all MPI ranks. That would eliminate the need for the masterproc to send this information to all other ranks.
@ekluzek would it be possible to figure out how many grid cells each MPI task will handle on rank? If it is a deterministic algorithm, it could be done on all MPI ranks. That would eliminate the need for the masterproc to send this information to all other ranks.
Yes, that is what's being done now. I think what you are saying is that the overhead and communication cost of the broadcast is going to much much lager and problematic than the small amount to do this on each task. That does make sense to me.
That could be done with having each task have an array of size(npes) or just have it figure it out for each tasks -- but only keep the values for the local task. The later is probably the way to do it -- at least for the end result.
#3469 takes care of this for mpasaep75 on 40k processors. I'm breaking that up into smaller segments that can come in and be reviewed more easily. But, when that comes in the bulk of this is completed. There are still some sub issues that could be worked on and be brought in later.
We met and went over some of this to both plan for the future and to have a minimum part to finish relatively soon.