CTSM icon indicating copy to clipboard operation
CTSM copied to clipboard

Scalability issue in decompInit_lnd

Open briandobbins opened this issue 11 months ago • 10 comments

In the decompInit_lnd routine (in src/main/decompInitMod.F90), there's a performance issue in the section below:

https://github.com/ESCOMP/CTSM/blob/7ff60613ddb8235396e0098b67bf0dfa79195e28/src/main/decompInitMod.F90#L226-L236

This happens because the main loop, outside of that, is over every cell, and this internal loop is over the number of clumps, which is at least equal to the number of PEs. For km-scale runs, this ends up being quite large - eg, on the 3.75km test case, the outer loop is ~42M, and the inner loop is >~40K, since that's the minimum we're able to run the case on. The conditional does restrict the inner loop to only happening on ~12.4M of the 42M cells, but it's still a problem to have things scale by the number of cores.

In terms of data, the loop above took between 754 - 1044 seconds, averaging 804 across all PEs. That's ~86% of the InitializeRealize call.

I've got a few ideas on things to try, including simply changing the complex conditionals as a temporary work-around, as well as saving/reading a decomposition, but would welcome insights from folks who know the land model.

Perhaps the other issue here, which may be more important in the end, is that it seems like 'clumps' is allocated on every rank for every rank?

https://github.com/ESCOMP/CTSM/blob/7ff60613ddb8235396e0098b67bf0dfa79195e28/src/main/decompInitMod.F90#L118

Again, I welcome input here - this is a challenge for memory scalability, at least, and unless this is needed elsewhere, we should move to a local-to-the-rank structure.

Anyway, just getting the issue in - I think I can create work-arounds for the near-term needs, but would be happy to chat with any land folks on this and see if we can get a SIF or some other way to focus on addressing this soon, too.

Thanks!

Definition of done:

  • [ ] Communicate with John about what we need to do
  • [x] Erik learns about MPI_Scan from the article
  • [ ] Possibly meet with John if we need to figure things out
  • [ ] Start adding self-tests and self-test namelist control to test decomposition
  • [ ] Add a unit test for decompInit?
  • [ ] Can I add unit tests that run in parallel?
  • [ ] Should we also have the ability to save the decomposition?
  • [ ] Implement the ability to save the decomposition if desired
  • [ ] Start implementing the solution on a branch
  • [ ] Figure out appropriate testing to do for it (is there a way to do this in a unit test?)
  • [ ] Do the needed testing
  • [ ] Make sure the MPI_SCAN works in mpi-serial
  • [ ] Upgrade to a PR and bring to CTSM master

briandobbins avatar Mar 09 '25 22:03 briandobbins

Thanks for creating this issue, @briandobbins. Let us know the timeline that's helpful for this to be addressed. As you know, we're kind of slammed with prepping for CLM6 / CESM3, so addressing this after the release will be more realistic. That said, we don't want poor scalability hindering the work you're trying to do for high res work.

wwieder avatar Mar 11 '25 20:03 wwieder

Great sleuthing, @briandobbins ! See also the doubly nested loop over clumps in decompInit_clumps, which should also maybe be addressed similarly (though it's not as bad as ngridcells*nclumps)?

billsacks avatar Mar 14 '25 21:03 billsacks

@briandobbins do you need any of us to do anything with this now?

ekluzek avatar Mar 20 '25 16:03 ekluzek

Not yet - this isn't critical for CESM3; I'll likely try to target this under some of the km-scale efforts in the near future, but wanted to make the issue so I don't forget it. Thanks!

briandobbins avatar Mar 20 '25 17:03 briandobbins

@briandobbins, I think @ekluzek will have time to start on the simple things here over the next month. Maybe you can help coordinate with John Dennis (please add his github handle to this thread) with Erik to determine what this may look like. I'm less sure that Erik will have time to devote much time to the testing of this first phase over Sprint 18 (basically May).

@ekluzek can you create some child issues for these steps in improving our initialization?

wwieder avatar Apr 29 '25 21:04 wwieder

Sure, sounds great -- I think from prior conversations with John, the init time issue is a 'simple thing' (in so far as anything is ever really 'simple'), but the memory savings would be more involved. Any progress is progress; even if we think it works in a narrow case (say, the CAM-MPAS decompositions we use for the km-scale runs), we can continue running / testing it there, with a big impact, until longer-term testing is viable. We did this with an ESMF patch too, until it could be brought in officially.

briandobbins avatar Apr 29 '25 22:04 briandobbins

Let me know when you want Erik and I to start working on the Init time issue.

johnmauff avatar Apr 29 '25 22:04 johnmauff

We had some email discussion about this that I'm adding here:

I'll add to this too that during Aaron Donahue's visit, we talked a bit about our initialization performance (because they have some issues too), and the land thing just came up in an email thread with him, but looking at their code, it seems they've addressed the performance issue - or, at least, they don't have the loop over ranks (clumps) within the loop over clumps. Here's a copy of the land decompInitMod.F90 file from E3SM v3.0.2 on Derecho:

/lustre/desc1/scratch/bdobbins/e3sm_land/E3SM-3.0.2/components/elm/src/main/decompInitMod.F90

I haven't looked into it enough to see if they're effectively doing John's solution, but I thought I'd just share in case it gives other ideas. Any solution is a good one!

Cheers,

  • Brian

On Tue, May 27, 2025 at 10:40 AM John Dennis <[email protected]> wrote: Erik,

My calendar is up-to-date, so I would encourage you to find something in the next couple of days. I do have a bit of background reading for you on the prefix sum algorithm. Specifically, the first section of the medium

article

describes a prefix sum algorithm. Fundamentally, the existing slow loop in CTSM performs a prefix scan on each MPI rank. > Instead, we will be replacing the loop with an MPI_SCAN function and some form of gather.

John

ekluzek avatar Jun 12 '25 22:06 ekluzek

To preserve the E3SM version that Brian had pointed to I'm adding the file here:

decompInitMod.F90.txt

There are three MPI_Scan calls in the file:

  !------------------------------------------------------------------------------
  subroutine decompInit_lnd_using_gp(lni, lnj, cellsOnCell, ncells_loc, maxEdges, &
       amask)
    !
    ! !DESCRIPTION:
    ! This subroutine initializes the land surface decomposition into a clump
    ! data structure using graph partitioning approach.  This assumes each pe
    ! has the same number of clumps set by clump_pproc.
.
.
.
    ! Determine the cell id offset on each processor
    cell_id_offset = 0
    call MPI_Scan(ncells_loc, cell_id_offset, 1, MPIU_INTEGER, MPI_SUM, mpicom, ierr)
    cell_id_offset = cell_id_offset - ncells_loc
  
    ! Determine the total number of grid cells
    call MPI_Allreduce(ncells_loc, ncells_tot, 1, MPIU_INTEGER, MPI_SUM, mpicom, ierr)
.
.
.
    !
    ! Compute information for each processor
    !
    procinfo%ncells = ncells_owned

    offset = 0
    call MPI_Scan(ncells_owned, offset, 1, MPIU_INTEGER, MPI_SUM, mpicom, ierr)
    procinfo%begg = offset + 1 - ncells_owned

    offset = 0
    call MPI_scan(ncells_owned, offset, 1, MPIU_INTEGER, MPI_SUM, mpicom, ierr)
    procinfo%endg = offset

There's a bunch of MCT glue around that. So our solution will need to work with ESMF rather than MCT, and might change how some of this works for us.

ekluzek avatar Jun 12 '25 22:06 ekluzek

Looking the code over, what I see that needs to be done:

The core of this is changing the subroutine:

decompInit_lnd

so that instead of doing clumps over every tasks for all clumps globally, it needs to be done over each task. There will be MPI communication and MPI reduction operations (i.e. MPI_SCAN) in order to get this part to work.

Then subroutine:

decompInit_clumps

is changed so that it only allocates memory over the clumps for the given task. This part should be straightforward. It also could be done before the changes to decompInit_lnd as each task should only use it's own clump information and shouldn't need any information on other clumps.

This subroutine should also be refactored to have the printing of the decomposition in it's own subroutine and that part should be done in such a way that each task prints it's info and waits for the next one. We probably should add more control over when and how it's printed as well.

I think subroutine decompInit_glcp won't need to be changed. It already has MPI communication going on, and as such I think it probably doesn't need to be touched I also don't see any global memory usage in it, whereas the first two subroutines mentioned above do.

ekluzek avatar Jun 18 '25 14:06 ekluzek

@johnmauff and I met last week. He took me through this document:

https://docs.google.com/document/d/1Q3QupotOTLtmf4KMsbu16dMccuNVHlfXp1JnEu1fkdw/

The goal stated above is:

GOAL: Memory usage and initialization time should be nearly constant

Right now both memory and initialization time increases rapidly with the number of tasks. In the end we should be able to get it pretty close to being flat for both.

We figure we will make small incremental improvements that can probably come in on b4b-dev. There might also be a long standing branch for any parts that can't come in. But, we'll only do that if it's really needed. One thing I wonder about doing that for right now, is to have some system tests that check the grids we want. But, even that could likely be put on a different test list and still come to main-dev. The other thing is the memory evaluation code, that is something that shouldn't come to main-dev as is. So there likely will need to be a branch for it, so it can be easily used.

We figured for testing we probably should do these lower resolutions to start: ne3pg{2,3}, ne30pg3

And then move on to higher resolutions: ne120, mpas15, At the end we then verify we have improvements in the ultra high resolution grids: mpas7p5, mpas3p75

I also talked about my idea of possibly adding unit-testing, but for sure adding some self-test control, so that it checks the decomposition for validity, as well as doing things, like shut down after the decomposition is done. So we'll plan on that and see how it pans out.

ekluzek avatar Jun 23 '25 19:06 ekluzek

@johnmauff also has some code that gets more detailed memory usage. Here's his notes on that:

  I have placed the memory usage code that I have been using to identify memory leaks in CAM.  I have placed it in my home directory on Derecho.

~dennis/MemoryLeak/share/proc_status_vm.F90

I place this file in the "share/src/" directory, which allows it to be accessible by all component models.  I typically only call the subroutine from a single MPI rank.  For example:

   if(masterproc) then 
      call prt_vm_status('CTSM: decompInit_lnd: point #1')
   endif

  It will print out a detailed memory usage that can then be plotted using a Python script.

Right now this is a simple one off to help in cases on Derecho like this. But, longer term we should examine if the gptl timers already have this capability or if this could be added to them if not. The solution is general for LINUX type machines (relying on how they store this information in files under the /proc directory), so may be general enough to come in close to as is.

In any case, since this has been an important tool for his work, this is something we'll be using for evaluation of this work.

ekluzek avatar Jun 23 '25 19:06 ekluzek

From talking with @johnmauff today, we realized we don't need three mpi_scans, as endg can just be set from the 2nd mpi_scan

So the last bits of the code above become:

    !
    ! Compute information for each processor
    !
    procinfo%ncells = ncells_owned

    offset = 0
    call MPI_Scan(ncells_owned, offset, 1, MPIU_INTEGER, MPI_SUM, mpicom, ierr)
    procinfo%endg = offset
    procinfo%begg = offset + 1 - ncells_owned

We also realized the graph partitioning done in the ELM code doesn't need to be done either.

ekluzek avatar Jul 15 '25 16:07 ekluzek

The other thing of note from the meeting with @johnmauff is that there's some non-scalable memory outside of decompInit, and there are some timing issues outside of the decompInit in initialize2 (technically also initialize1, but it's an order of magnitude smaller).

See:

https://docs.google.com/document/d/1Q3QupotOTLtmf4KMsbu16dMccuNVHlfXp1JnEu1fkdw/edit?tab=t.0#heading=h.n72s2q6lmmcc

ekluzek avatar Jul 15 '25 18:07 ekluzek

@johnmauff and I in our standup this morning went over the E3SM example code to figure out what we need to do here. Currerntly every processor has a copy of the complete map of processors to grid-cell and the cells that each other MPI task manages. And every processor repeats the same calculation for the entire grid.

What we will be doing is going to a framework like this:

  1. Figure out how many gridcells each MPI task will handle -- on all tasks (avoid doing a MPI broadcast)
  2. Each tasks preserves it's info on how many gridcells it will handle
  3. Use the MPI_SCAN to figure out the offsets so you can get the global index of begg and endg for my task

At that point you can fill out all the information for the local task.

Which means ultimately we can have structures where each task only has the information it needs locally about the beginning and ending points. But, to do this in steps to make sure we don't mess things up, we will likely remove global structures one by one. The E3SM example shows this as well, because each task retains the global clump structure even though ultimately it won't need it.

I'm going to make some design notes on all this and get started on doing some of this, this week.

ekluzek avatar Aug 26 '25 17:08 ekluzek

@ekluzek would it be possible to figure out how many grid cells each MPI task will handle on rank? If it is a deterministic algorithm, it could be done on all MPI ranks. That would eliminate the need for the masterproc to send this information to all other ranks.

johnmauff avatar Aug 26 '25 17:08 johnmauff

@ekluzek would it be possible to figure out how many grid cells each MPI task will handle on rank? If it is a deterministic algorithm, it could be done on all MPI ranks. That would eliminate the need for the masterproc to send this information to all other ranks.

Yes, that is what's being done now. I think what you are saying is that the overhead and communication cost of the broadcast is going to much much lager and problematic than the small amount to do this on each task. That does make sense to me.

That could be done with having each task have an array of size(npes) or just have it figure it out for each tasks -- but only keep the values for the local task. The later is probably the way to do it -- at least for the end result.

ekluzek avatar Aug 26 '25 18:08 ekluzek

#3469 takes care of this for mpasaep75 on 40k processors. I'm breaking that up into smaller segments that can come in and be reviewed more easily. But, when that comes in the bulk of this is completed. There are still some sub issues that could be worked on and be brought in later.

ekluzek avatar Sep 26 '25 17:09 ekluzek

We met and went over some of this to both plan for the future and to have a minimum part to finish relatively soon.

ekluzek avatar Oct 21 '25 18:10 ekluzek