CAM icon indicating copy to clipboard operation
CAM copied to clipboard

RRTMGP not working with GPUs on derecho

Open brian-eaton opened this issue 1 year ago • 1 comments

What happened?

@sjsprecious reported that CAM w/ RRTMGP is not running with GPUs on derecho. He is also not able to run the standalone RRTMGP tests with tag v1.7 of the rte-rrtmgp external. He has contacted Robert Pincus about this problem.

What are the steps to reproduce the bug?

From Jian:

I used the nvhpc/23.7 compiler and cuda/12.2.1 version. I downloaded the input data from [email protected]:earth-system-radiation/rrtmgp-data.git I set "RTE_KERNELS=accel" to enable the GPU code. My compiler flags are "-g -Minfo -Mchkptr -Mstandard -Kieee -Mchkstk -Mallocatable=03 -Mpreprocess -acc -gpu=cc80,lineinfo -Minfo=accel".

What CAM tag were you using?

cam6_3_148 and later

What machine were you running CAM on?

CISL machine (e.g. cheyenne)

What compiler were you using?

NVHPC

Path to a case directory, if applicable

No response

Will you be addressing this bug yourself?

No

Extra info

No response

brian-eaton avatar Mar 18 '24 17:03 brian-eaton

Thanks Brian for opening this issue. To provide additional information:

  • The failure of running RRTMGP GPU code in CAM is caused by this change (https://github.com/ESCOMP/CAM/blob/cam6_3_148/src/physics/rrtmgp/radiation.F90#L964). In particular, the problem seems to come from changing the type of fswc from ty_fluxes_byband to ty_fluxes_broadband. Switching back to type ty_fluxes_byband for fswc works for the GPU code. However, based on the discussion with Brian, this does not make sense as type ty_fluxes_byband just allocates a few additional arrays than type ty_fluxes_broadband, and those arrays are not used by fswc calculation anyway.
  • According to Robert, ICON model used ty_fluxes_broadband type for fswc and it worked fine (https://gitlab.dkrz.de/icon/icon-model/-/blob/release-2024.01-public/src/atm_phy_rte_rrtmgp/mo_rte_rrtmgp_interface.f90?ref_type=heads#L643).
  • I also tried newer NVIDIA compiler nvhpc/24.1 for the RTE-RRTMGP standalone code but it still failed for both CPU & GPU runs. However, the same nvhpc compiler worked fine in the CI workflow for RTE-RRTMGP (CPU: https://github.com/earth-system-radiation/rte-rrtmgp/actions/runs/7962992343/job/21737673158; GPU: https://github.com/earth-system-radiation/rte-rrtmgp/actions/runs/7962992343/job/21737673722).

Current status:

  • I am working with a CSG staff to install a different version of netCDF on Derecho and see if it resolves the error in the RTE-RRTMGP standalone run (Derecho uses netcdf/4.9.2 while CI workflow uses netcdf/4.5.4).
  • Supreeth from ASAP/CISL is helping debug the CAM+RRTMGP GPU run and try to find a solution for the error 700: Illegal address during kernel execution error.

sjsprecious avatar Mar 18 '24 18:03 sjsprecious