TransformerEngine icon indicating copy to clipboard operation
TransformerEngine copied to clipboard

Compiling on Slurmcluster fatal error: cudnn.h: No such file or directory

Open windprak opened this issue 1 year ago • 4 comments

I try to compile TE on a slurmcluster because containers aren't fully supported (MPI issues). My setup is like this:


module load cuda/12.4.1
module load cmake/3.23.1 
module load git/2.35.2 
module load gcc/12.1.0
module load cudnn/9.1.0.70-12.x

source $WORK/venvs/megatron/bin/activate
python -m pip install --force-reinstall setuptools==69.5.1.
python -m pip install nltk sentencepiece einops mpmath packaging numpy ninja wheel
pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu124
pip install wheel
MAX_JOBS=4 pip install flash-attn==2.4.2. --no-build-isolation
pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable

export CXXFLAGS=-isystem\ $CUDNN_ROOT/include
pip install git+https://github.com/NVIDIA/TransformerEngine.git@main  #or stable doesn't matter

All the variables echo well. I can build megatron-lm and apex in this environment, no problem. But not TE.

Error:

conda/envs/megatron/lib/python3.10/site-packages/torch/include/ATen/cudnn/cudnn-wrapper.h:3:10: fatal error: cudnn.h: No such file or directory
          3 | #include <cudnn.h>
            |          ^~~~~~~~~

windprak avatar Jun 12 '24 12:06 windprak

It looks like PyTorch's C++ extensions are configured with CUDNN_HOME or CUDNN_PATH: https://github.com/pytorch/pytorch/blob/5a80d2df844f9794b3b7ad91eddc7ba762760ad0/torch/utils/cpp_extension.py#L209 PyTorch's build is configured with CUDNN_ROOT: https://github.com/pytorch/pytorch/blob/5a80d2df844f9794b3b7ad91eddc7ba762760ad0/cmake/Modules_CUDA_fix/FindCUDNN.cmake#L4

timmoon10 avatar Jun 13 '24 19:06 timmoon10

It looks like PyTorch's C++ extensions are configured with CUDNN_HOME or CUDNN_PATH: https://github.com/pytorch/pytorch/blob/5a80d2df844f9794b3b7ad91eddc7ba762760ad0/torch/utils/cpp_extension.py#L209 PyTorch's build is configured with CUDNN_ROOT: https://github.com/pytorch/pytorch/blob/5a80d2df844f9794b3b7ad91eddc7ba762760ad0/cmake/Modules_CUDA_fix/FindCUDNN.cmake#L4

so what i can do to handle this issue? please give a clear and simple answer thx!

ywb2018 avatar Jun 22 '24 09:06 ywb2018

export CUDNN_PATH=/path/to/cudnn
pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable

timmoon10 avatar Jun 25 '24 00:06 timmoon10

I believe there is a somewhere inconsistency:

I did get the same error:

      CMake Error at /tmp/pip-req-build-pq0o2oig/3rdparty/cudnn-frontend/cmake/cuDNN.cmake:3 (find_path):
        Could not find CUDNN_INCLUDE_DIR using the following files: cudnn.h
      Call Stack (most recent call first):
        CMakeLists.txt:44 (include)

and I followed suggestion of setting CUDNN_PATH which helped CMake to find the correct paths:

$ CUDNN_PATH=/home/<some-path>/python3.10/site-packages/nvidia/cudnn pip3 install git+https://github.com/NVIDIA/TransformerEngine.git@stable

...
      -- cudnn found at /home/pramod/workspace/nemo-home/nemo-conda-install/conda_envs/nemo/lib/python3.10/site-packages/nvidia/cudnn/lib/libcudnn.so.9.
      -- Found LIBRARY: /home/pramod/workspace/nemo-home/nemo-conda-install/conda_envs/nemo/lib/python3.10/site-packages/nvidia/cudnn/include
      -- cuDNN: /home/pramod/workspace/nemo-home/nemo-conda-install/conda_envs/nemo/lib/python3.10/site-packages/nvidia/cudnn/lib/libcudnn.so.9
      -- cuDNN: /home/pramod/workspace/nemo-home/nemo-conda-install/conda_envs/nemo/lib/python3.10/site-packages/nvidia/cudnn/include
      -- cudnn_cnn found at /home/pramod/workspace/nemo-home/nemo-conda-install/conda_envs/nemo/lib/python3.10/site-packages/nvidia/cudnn/lib/libcudnn_cnn.so.9.
      -- cudnn_adv found at /home/pramod/workspace/nemo-home/nemo-conda-install/conda_envs/nemo/lib/python3.10/site-packages/nvidia/cudnn/lib/libcudnn_adv.so.9.
      -- cudnn_graph found at /home/pramod/workspace/nemo-home/nemo-conda-install/conda_envs/nemo/lib/python3.10/site-packages/nvidia/cudnn/lib/libcudnn_graph.so.9.
...

but then build fails with:

...
      Run Build Command(s): /home/pramod/workspace/nemo-home/nemo-conda-install/conda_envs/nemo/bin/ninja -v
      [1/43] /usr/bin/c++ -DNV_CUDNN_FRONTEND_USE_DYNAMIC_LOADING -Dtransformer_engine_EXPORTS -I/tmp/pip-req-build-sbks6idg/transformer_engine/common/.. -I/tmp/pip-req-build-sbks6idg/transformer_engine/common/include -I/tmp/pip-req-build-sbks6idg/transformer_engine/common/../../3rdparty/cudnn-frontend/include -I/tmp/pip-req-build-sbks6idg/build/cmake/string_headers -isystem /usr/local/cuda-12.4/targets/x86_64-linux/include -Wl,--version-script=/tmp/pip-req-build-sbks6idg/transformer_engine/common/libtransformer_engine.version -O3 -DNDEBUG -std=gnu++17 -fPIC -MD -MT CMakeFiles/transformer_engine.dir/cudnn_utils.cpp.o -MF CMakeFiles/transformer_engine.dir/cudnn_utils.cpp.o.d -o CMakeFiles/transformer_engine.dir/cudnn_utils.cpp.o -c /tmp/pip-req-build-sbks6idg/transformer_engine/common/cudnn_utils.cpp
      FAILED: CMakeFiles/transformer_engine.dir/cudnn_utils.cpp.o
      /usr/bin/c++ -DNV_CUDNN_FRONTEND_USE_DYNAMIC_LOADING -Dtransformer_engine_EXPORTS -I/tmp/pip-req-build-sbks6idg/transformer_engine/common/.. -I/tmp/pip-req-build-sbks6idg/transformer_engine/common/include -I/tmp/pip-req-build-sbks6idg/transformer_engine/common/../../3rdparty/cudnn-frontend/include -I/tmp/pip-req-build-sbks6idg/build/cmake/string_headers -isystem /usr/local/cuda-12.4/targets/x86_64-linux/include -Wl,--version-script=/tmp/pip-req-build-sbks6idg/transformer_engine/common/libtransformer_engine.version -O3 -DNDEBUG -std=gnu++17 -fPIC -MD -MT CMakeFiles/transformer_engine.dir/cudnn_utils.cpp.o -MF CMakeFiles/transformer_engine.dir/cudnn_utils.cpp.o.d -o CMakeFiles/transformer_engine.dir/cudnn_utils.cpp.o -c /tmp/pip-req-build-sbks6idg/transformer_engine/common/cudnn_utils.cpp
      In file included from /tmp/pip-req-build-sbks6idg/transformer_engine/common/cudnn_utils.cpp:7:
      /tmp/pip-req-build-sbks6idg/transformer_engine/common/cudnn_utils.h:10:10: fatal error: cudnn.h: No such file or directory
         10 | #include <cudnn.h>
...

In #954, we do provide suggestion of setting CPLUS_INCLUDE_PATH which helps but IMO CMake should handle this transparently.

pramodk avatar Mar 23 '25 07:03 pramodk

Is this still an issue? The fix I suggested above was merged in https://github.com/NVIDIA/TransformerEngine/pull/1589.

pramodk avatar Jun 07 '25 19:06 pramodk