Compiling on Slurmcluster fatal error: cudnn.h: No such file or directory
I try to compile TE on a slurmcluster because containers aren't fully supported (MPI issues). My setup is like this:
module load cuda/12.4.1
module load cmake/3.23.1
module load git/2.35.2
module load gcc/12.1.0
module load cudnn/9.1.0.70-12.x
source $WORK/venvs/megatron/bin/activate
python -m pip install --force-reinstall setuptools==69.5.1.
python -m pip install nltk sentencepiece einops mpmath packaging numpy ninja wheel
pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu124
pip install wheel
MAX_JOBS=4 pip install flash-attn==2.4.2. --no-build-isolation
pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable
export CXXFLAGS=-isystem\ $CUDNN_ROOT/include
pip install git+https://github.com/NVIDIA/TransformerEngine.git@main #or stable doesn't matter
All the variables echo well. I can build megatron-lm and apex in this environment, no problem. But not TE.
Error:
conda/envs/megatron/lib/python3.10/site-packages/torch/include/ATen/cudnn/cudnn-wrapper.h:3:10: fatal error: cudnn.h: No such file or directory
3 | #include <cudnn.h>
| ^~~~~~~~~
It looks like PyTorch's C++ extensions are configured with CUDNN_HOME or CUDNN_PATH:
https://github.com/pytorch/pytorch/blob/5a80d2df844f9794b3b7ad91eddc7ba762760ad0/torch/utils/cpp_extension.py#L209
PyTorch's build is configured with CUDNN_ROOT:
https://github.com/pytorch/pytorch/blob/5a80d2df844f9794b3b7ad91eddc7ba762760ad0/cmake/Modules_CUDA_fix/FindCUDNN.cmake#L4
It looks like PyTorch's C++ extensions are configured with
CUDNN_HOMEorCUDNN_PATH: https://github.com/pytorch/pytorch/blob/5a80d2df844f9794b3b7ad91eddc7ba762760ad0/torch/utils/cpp_extension.py#L209 PyTorch's build is configured withCUDNN_ROOT: https://github.com/pytorch/pytorch/blob/5a80d2df844f9794b3b7ad91eddc7ba762760ad0/cmake/Modules_CUDA_fix/FindCUDNN.cmake#L4
so what i can do to handle this issue? please give a clear and simple answer thx!
export CUDNN_PATH=/path/to/cudnn
pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable
I believe there is a somewhere inconsistency:
I did get the same error:
CMake Error at /tmp/pip-req-build-pq0o2oig/3rdparty/cudnn-frontend/cmake/cuDNN.cmake:3 (find_path):
Could not find CUDNN_INCLUDE_DIR using the following files: cudnn.h
Call Stack (most recent call first):
CMakeLists.txt:44 (include)
and I followed suggestion of setting CUDNN_PATH which helped CMake to find the correct paths:
$ CUDNN_PATH=/home/<some-path>/python3.10/site-packages/nvidia/cudnn pip3 install git+https://github.com/NVIDIA/TransformerEngine.git@stable
...
-- cudnn found at /home/pramod/workspace/nemo-home/nemo-conda-install/conda_envs/nemo/lib/python3.10/site-packages/nvidia/cudnn/lib/libcudnn.so.9.
-- Found LIBRARY: /home/pramod/workspace/nemo-home/nemo-conda-install/conda_envs/nemo/lib/python3.10/site-packages/nvidia/cudnn/include
-- cuDNN: /home/pramod/workspace/nemo-home/nemo-conda-install/conda_envs/nemo/lib/python3.10/site-packages/nvidia/cudnn/lib/libcudnn.so.9
-- cuDNN: /home/pramod/workspace/nemo-home/nemo-conda-install/conda_envs/nemo/lib/python3.10/site-packages/nvidia/cudnn/include
-- cudnn_cnn found at /home/pramod/workspace/nemo-home/nemo-conda-install/conda_envs/nemo/lib/python3.10/site-packages/nvidia/cudnn/lib/libcudnn_cnn.so.9.
-- cudnn_adv found at /home/pramod/workspace/nemo-home/nemo-conda-install/conda_envs/nemo/lib/python3.10/site-packages/nvidia/cudnn/lib/libcudnn_adv.so.9.
-- cudnn_graph found at /home/pramod/workspace/nemo-home/nemo-conda-install/conda_envs/nemo/lib/python3.10/site-packages/nvidia/cudnn/lib/libcudnn_graph.so.9.
...
but then build fails with:
...
Run Build Command(s): /home/pramod/workspace/nemo-home/nemo-conda-install/conda_envs/nemo/bin/ninja -v
[1/43] /usr/bin/c++ -DNV_CUDNN_FRONTEND_USE_DYNAMIC_LOADING -Dtransformer_engine_EXPORTS -I/tmp/pip-req-build-sbks6idg/transformer_engine/common/.. -I/tmp/pip-req-build-sbks6idg/transformer_engine/common/include -I/tmp/pip-req-build-sbks6idg/transformer_engine/common/../../3rdparty/cudnn-frontend/include -I/tmp/pip-req-build-sbks6idg/build/cmake/string_headers -isystem /usr/local/cuda-12.4/targets/x86_64-linux/include -Wl,--version-script=/tmp/pip-req-build-sbks6idg/transformer_engine/common/libtransformer_engine.version -O3 -DNDEBUG -std=gnu++17 -fPIC -MD -MT CMakeFiles/transformer_engine.dir/cudnn_utils.cpp.o -MF CMakeFiles/transformer_engine.dir/cudnn_utils.cpp.o.d -o CMakeFiles/transformer_engine.dir/cudnn_utils.cpp.o -c /tmp/pip-req-build-sbks6idg/transformer_engine/common/cudnn_utils.cpp
FAILED: CMakeFiles/transformer_engine.dir/cudnn_utils.cpp.o
/usr/bin/c++ -DNV_CUDNN_FRONTEND_USE_DYNAMIC_LOADING -Dtransformer_engine_EXPORTS -I/tmp/pip-req-build-sbks6idg/transformer_engine/common/.. -I/tmp/pip-req-build-sbks6idg/transformer_engine/common/include -I/tmp/pip-req-build-sbks6idg/transformer_engine/common/../../3rdparty/cudnn-frontend/include -I/tmp/pip-req-build-sbks6idg/build/cmake/string_headers -isystem /usr/local/cuda-12.4/targets/x86_64-linux/include -Wl,--version-script=/tmp/pip-req-build-sbks6idg/transformer_engine/common/libtransformer_engine.version -O3 -DNDEBUG -std=gnu++17 -fPIC -MD -MT CMakeFiles/transformer_engine.dir/cudnn_utils.cpp.o -MF CMakeFiles/transformer_engine.dir/cudnn_utils.cpp.o.d -o CMakeFiles/transformer_engine.dir/cudnn_utils.cpp.o -c /tmp/pip-req-build-sbks6idg/transformer_engine/common/cudnn_utils.cpp
In file included from /tmp/pip-req-build-sbks6idg/transformer_engine/common/cudnn_utils.cpp:7:
/tmp/pip-req-build-sbks6idg/transformer_engine/common/cudnn_utils.h:10:10: fatal error: cudnn.h: No such file or directory
10 | #include <cudnn.h>
...
In #954, we do provide suggestion of setting CPLUS_INCLUDE_PATH which helps but IMO CMake should handle this transparently.
Is this still an issue? The fix I suggested above was merged in https://github.com/NVIDIA/TransformerEngine/pull/1589.