RRTMGP not working with GPUs on derecho
What happened?
@sjsprecious reported that CAM w/ RRTMGP is not running with GPUs on derecho. He is also not able to run the standalone RRTMGP tests with tag v1.7 of the rte-rrtmgp external. He has contacted Robert Pincus about this problem.
What are the steps to reproduce the bug?
From Jian:
I used the nvhpc/23.7 compiler and cuda/12.2.1 version. I downloaded the input data from [email protected]:earth-system-radiation/rrtmgp-data.git I set "RTE_KERNELS=accel" to enable the GPU code. My compiler flags are "-g -Minfo -Mchkptr -Mstandard -Kieee -Mchkstk -Mallocatable=03 -Mpreprocess -acc -gpu=cc80,lineinfo -Minfo=accel".
What CAM tag were you using?
cam6_3_148 and later
What machine were you running CAM on?
CISL machine (e.g. cheyenne)
What compiler were you using?
NVHPC
Path to a case directory, if applicable
No response
Will you be addressing this bug yourself?
No
Extra info
No response
Thanks Brian for opening this issue. To provide additional information:
- The failure of running RRTMGP GPU code in CAM is caused by this change (https://github.com/ESCOMP/CAM/blob/cam6_3_148/src/physics/rrtmgp/radiation.F90#L964). In particular, the problem seems to come from changing the type of
fswcfromty_fluxes_bybandtoty_fluxes_broadband. Switching back to typety_fluxes_bybandforfswcworks for the GPU code. However, based on the discussion with Brian, this does not make sense as typety_fluxes_bybandjust allocates a few additional arrays than typety_fluxes_broadband, and those arrays are not used byfswccalculation anyway. - According to Robert, ICON model used
ty_fluxes_broadbandtype forfswcand it worked fine (https://gitlab.dkrz.de/icon/icon-model/-/blob/release-2024.01-public/src/atm_phy_rte_rrtmgp/mo_rte_rrtmgp_interface.f90?ref_type=heads#L643). - I also tried newer NVIDIA compiler
nvhpc/24.1for the RTE-RRTMGP standalone code but it still failed for both CPU & GPU runs. However, the same nvhpc compiler worked fine in the CI workflow for RTE-RRTMGP (CPU: https://github.com/earth-system-radiation/rte-rrtmgp/actions/runs/7962992343/job/21737673158; GPU: https://github.com/earth-system-radiation/rte-rrtmgp/actions/runs/7962992343/job/21737673722).
Current status:
- I am working with a CSG staff to install a different version of netCDF on Derecho and see if it resolves the error in the RTE-RRTMGP standalone run (Derecho uses netcdf/4.9.2 while CI workflow uses netcdf/4.5.4).
- Supreeth from ASAP/CISL is helping debug the CAM+RRTMGP GPU run and try to find a solution for the
error 700: Illegal address during kernel executionerror.