MFC icon indicating copy to clipboard operation
MFC copied to clipboard

New modules for Bridges2 GPU

Open sbryngelson opened this issue 1 year ago • 2 comments

It seems the previous modules for Bridges2 are no longer working. I found these to work:

  1) openmpi/4.0.5-nvhpc22.9   2) python/3.8.6   3) cuda/11.7.1

along with an

export FC=nvfortran
export CC=nvc
export CXX=nvc++

sbryngelson avatar Sep 07 '24 16:09 sbryngelson

Update: Related to #613 . I found that the above works when turning set(NVHPC_USE_TWO_PASS_IPO TRUE) to set(NVHPC_USE_TWO_PASS_IPO FALSE) in CMakeLists.txt, but does not work if twopassIPO is TRUE. This serves as a confirmation that at least nvhpc 23.X is required for two-pass IPO.

sbryngelson avatar Sep 07 '24 16:09 sbryngelson

Briges2 does not have nvhpc23.X so I cannot test with these

sbryngelson avatar Sep 07 '24 16:09 sbryngelson

Current modules on mfc.sh load

b     PSC Bridges2
b-all python/3.8.6
b-cpu allocations/1.0 gcc/10.2.0 openmpi/4.0.5-gcc10.2.0
b-gpu openmpi/4.0.5-nvhpc22.9 nvhpc/22.9 cuda
b-gpu CC=nvc CXX=nvc++ FC=nvfortran

Newest modules on Bridges2

b     PSC Bridges2
b-all python/3.8.6
b-cpu allocations/1.0 gcc/10.2.0 openmpi/4.0.5-gcc10.2.0
b-gpu openmpi/4.0.5-nvhpc22.9 nvhpc/22.9 cuda/12.4.0
b-gpu CC=nvc CXX=nvc++ FC=nvfortran

Quick note: if outdated python version error ever occurs on Bridges2, quick way fix is to load anaconda3. It always works for me somehow.

[almahrou@bridges2-login012 MFC]$ ./mfc.sh build
mfc: ERROR > Python 3.8.6 (python3) is out of date. Required >= 3.9.

Malmahrouqi3 avatar Jun 24 '25 04:06 Malmahrouqi3

so you can only build if you load anaconda? if so, we should add it to the bridges2 modules

sbryngelson avatar Jun 24 '25 05:06 sbryngelson

kinda I am still tackling an issue with environment variables and yeah below is a version of mfc.sh that is supposedly working I am still figuring out things right now.

#!/bin/bash
# Custom MFC build script that uses Python 3.11

# Check whether this script was called from MFC's root directory.
if [ ! -f "$(pwd)/toolchain/util.sh" ]; then
    echo "build_mfc.sh: ERROR > You must call this script from within MFC's root folder."
    exit 1
fi

# Load utility script
. "$(pwd)/toolchain/util.sh"

# Clean and recreate the build directory
rm -rf "$(pwd)/build"
mkdir -p "$(pwd)/build"

# Create a temporary directory with the correct Python version
mkdir -p "$(pwd)/build/temp_bin"
ln -sf /usr/bin/python3.11 "$(pwd)/build/temp_bin/python3"

# Update the PATH to use our Python 3.11 and MPI
export PATH="$(pwd)/build/temp_bin:/jet/packages/nvidia/hpc_sdk/Linux_x86_64/22.9/comm_libs/openmpi4/openmpi-4.0.5/bin:$PATH"

# Set MPI environment variables for CMake
export MPI_HOME="/jet/packages/nvidia/hpc_sdk/Linux_x86_64/22.9/comm_libs/openmpi4/openmpi-4.0.5"
export MPI_ROOT="$MPI_HOME"
export MPI_Fortran_COMPILER="$MPI_HOME/bin/mpifort"
export MPI_C_COMPILER="$MPI_HOME/bin/mpicc"
export MPI_CXX_COMPILER="$MPI_HOME/bin/mpicxx"

# Set library and include paths
export LD_LIBRARY_PATH="$MPI_HOME/lib:$LD_LIBRARY_PATH"
export CPATH="$MPI_HOME/include:$CPATH"

# Additional CMake hints for MPI detection
export CMAKE_PREFIX_PATH="$MPI_HOME:$CMAKE_PREFIX_PATH"

# Set GPU compute capability for modern GPUs (avoid compute_35 error)
# Using compute capability 7.0 which is widely supported (V100, RTX series, etc.)
export MFC_CUDA_CC="70"

# Alternative CUDA configuration
export NVHPC_CUDA_HOME="/opt/packages/nvidia/hpc_sdk/Linux_x86_64/22.9/cuda"

# Force the compiler to use a specific CUDA version and architecture
export CUDA_HOME="/opt/packages/nvidia/hpc_sdk/Linux_x86_64/22.9/cuda"

# Additional NVIDIA compiler flags to avoid compute_35
export NVCC_ARGS="-arch=sm_70"
export GPU_ARCH="70"

# Source the CMake and Python bootstrap scripts
. "$(pwd)/toolchain/bootstrap/cmake.sh"
. "$(pwd)/toolchain/bootstrap/python.sh"

echo

# Run the main.py bootstrap script with the desired arguments
python3 "$(pwd)/toolchain/main.py" "$@"
code=$?

echo

if [ $code -ne 0 ]; then
    error "main.py finished with a $code exit code."
fi

# Deactivate the Python virtualenv in case the user "source"'d this script
if command -v deactivate > /dev/null 2>&1; then
    log "(venv) Exiting the$MAGENTA Python$COLOR_RESET virtual environment."
    deactivate
fi

# Clean up temporary files
rm -rf "$(pwd)/build/temp_bin"

# Exit with proper exit code
exit $code

Malmahrouqi3 avatar Jun 24 '25 05:06 Malmahrouqi3

Also, worth mentioning, I have seen instances before where conda & py env's clashed with each other so one had to be compromised. I am not 100% sure if this is gonna be an issue with MFC.

Malmahrouqi3 avatar Jun 24 '25 05:06 Malmahrouqi3

@Malmahrouqi3 something seems strange, what's the purpose of the script above? If you can find Python 3.11 on a system in /usr/bin you can put it first in the path via export PATH=/usr/bin:$PATH and call it a day. If it's for a specific computer you could even do this reassignment in the modules file.

sbryngelson avatar Jun 24 '25 12:06 sbryngelson

closed by @Malmahrouqi3 by Updated Bridges2 Modules (CPU/GPU) #905

sbryngelson avatar Jul 02 '25 14:07 sbryngelson