awsome-distributed-training icon indicating copy to clipboard operation
awsome-distributed-training copied to clipboard

17.SM-modelparallelv2 conda script doesn't work

Open junpuf opened this issue 1 year ago • 1 comments

The script setup_conda_env.sh for 17.SM-modelparallelv2 test case contains problematic code detailed below:

  1. hardcoded MAX_JOBS=64, this will cause the smaller EC2 to fail to respond. Recommend to change to MAX_JOBS=$(nproc)
MAX_JOBS=64 pip install flash-attn==2.3.3 --no-build-isolation
  1. cudnn-linux-x86_64-8.9.5.30_cuda11-archive.tar.xz and cudnn-linux-x86_64-8.9.7.29_cuda12-archive.tar.xz are not downloaded in the previous steps but just used in the script
if [ $SMP_CUDA_VER == "11.8" ]; then
    # cuDNN installation for TransformerEngine installation for cuda11.8
    tar xf cudnn-linux-x86_64-8.9.5.30_cuda11-archive.tar.xz \
        && rm -rf /usr/local/cuda-$SMP_CUDA_VER/include/cudnn* /usr/local/cuda-$SMP_CUDA_VER/lib/cudnn* \
        && cp ./cudnn-linux-x86_64-8.9.5.30_cuda11-archive/include/* /usr/local/cuda-$SMP_CUDA_VER/include/ \
        && cp ./cudnn-linux-x86_64-8.9.5.30_cuda11-archive/lib/* /usr/local/cuda-$SMP_CUDA_VER/lib/ \
        && rm -rf cudnn-linux-x86_64-8.9.5.30_cuda11-archive.tar.xz \
        && rm -rf cudnn-linux-x86_64-8.9.5.30_cuda11-archive/
else
    # cuDNN installation for TransformerEngine installation for cuda12.1
    tar xf cudnn-linux-x86_64-8.9.7.29_cuda12-archive.tar.xz \
        && rm -rf /usr/local/cuda-$SMP_CUDA_VER/include/cudnn* /usr/local/cuda-$SMP_CUDA_VER/lib/cudnn* \
        && cp ./cudnn-linux-x86_64-8.9.7.29_cuda12-archive/include/* /usr/local/cuda-$SMP_CUDA_VER/include/ \
        && cp ./cudnn-linux-x86_64-8.9.7.29_cuda12-archive/lib/* /usr/local/cuda-$SMP_CUDA_VER/lib/ \
        && rm -rf cudnn-linux-x86_64-8.9.7.29_cuda12-archive.tar.xz \
        && rm -rf cudnn-linux-x86_64-8.9.7.29_cuda12-archive/
fi

junpuf avatar Oct 09 '24 22:10 junpuf

#453

junpuf avatar Oct 09 '24 22:10 junpuf

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] avatar Jan 10 '25 02:01 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Mar 11 '25 02:03 github-actions[bot]