Open-Sora icon indicating copy to clipboard operation
Open-Sora copied to clipboard

ERROR: Could not build wheels for apex, which is required to install pyproject.toml-based projects

Open TJ-Ouyang opened this issue 1 year ago • 6 comments

Using pip 23.1.2 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10) git version 2.34.1 torch.version = 2.2.1+cu121

Compiling cuda extensions with nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Tue_Aug_15_22:02:13_PDT_2023 Cuda compilation tools, release 12.2, V12.2.140 Build cuda_12.2.r12.2/compiler.33191640_0 from /usr/local/cuda/bin

Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in main() File "/usr/local/lib/python3.10/dist-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main json_out['return_val'] = hook(**hook_input['kwargs']) File "/usr/local/lib/python3.10/dist-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 251, in build_wheel return _build_backend().build_wheel(wheel_directory, config_settings, File "/usr/local/lib/python3.10/dist-packages/setuptools/build_meta.py", line 416, in build_wheel return self._build_with_temp_dir(['bdist_wheel'], '.whl', File "/usr/local/lib/python3.10/dist-packages/setuptools/build_meta.py", line 401, in _build_with_temp_dir self.run_setup() File "/usr/local/lib/python3.10/dist-packages/setuptools/build_meta.py", line 338, in run_setup exec(code, locals()) File "", line 178, in File "", line 40, in check_cuda_torch_binary_vs_bare_metal RuntimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries. Pytorch binaries were compiled with Cuda 12.1. In some cases, a minor-version mismatch will not cause later errors: https://github.com/NVIDIA/apex/pull/323#discussion_r287021798. You can try commenting out this check (at your own risk). error: subprocess-exited-with-error

× Building wheel for apex (pyproject.toml) did not run successfully. │ exit code: 1 ╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip. full command: /usr/bin/python3 /usr/local/lib/python3.10/dist-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py build_wheel /tmp/tmpvipwq2mw cwd: /tmp/pip-req-build-isqlmxnv Building wheel for apex (pyproject.toml) ... error ERROR: Failed building wheel for apex Failed to build apex ERROR: Could not build wheels for apex, which is required to install pyproject.toml-based projects


ModuleNotFoundError: No module named 'fused_layer_norm_cuda' When running the inference code.

Still could not resolve the question following the method in: "https://github.com/NVIDIA/apex/issues/1653". Tried on both server and colab.

TJ-Ouyang avatar Apr 01 '24 14:04 TJ-Ouyang

This is a common error due to your systems' global Nvidia driver (12.2) and pytorch cuda (12.1) version mismatch. You should comment out this image

Edenzzzz avatar Apr 02 '24 05:04 Edenzzzz

This is a common error due to your systems' global Nvidia driver (12.2) and pytorch cuda (12.1) version mismatch. You should comment out this image

Then it comes with another error:

[1/1] c++ -MMD -MF /data1/ouyangtianjian/apex-22.04-dev/build/temp.linux-x86_64-cpython-310/csrc/flatten_unflatten.o.d -pthread -B /data1/ouyangtianjian/.conda/envs/opensora/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /data1/ouyangtianjian/.conda/envs/opensora/include -fPIC -O2 -isystem /data1/ouyangtianjian/.conda/envs/opensora/include -fPIC -I/data1/ouyangtianjian/.conda/envs/opensora/lib/python3.10/site-packages/torch/include -I/data1/ouyangtianjian/.conda/envs/opensora/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/data1/ouyangtianjian/.conda/envs/opensora/lib/python3.10/site-packages/torch/include/TH -I/data1/ouyangtianjian/.conda/envs/opensora/lib/python3.10/site-packages/torch/include/THC -I/data1/ouyangtianjian/.conda/envs/opensora/include/python3.10 -c -c /data1/ouyangtianjian/apex-22.04-dev/csrc/flatten_unflatten.cpp -o /data1/ouyangtianjian/apex-22.04-dev/build/temp.linux-x86_64-cpython-310/csrc/flatten_unflatten.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="gcc"' '-DPYBIND11_STDLIB="libstdcpp"' '-DPYBIND11_BUILD_ABI="cxxabi1011"' -DTORCH_EXTENSION_NAME=apex_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17 g++ -pthread -B /data1/ouyangtianjian/.conda/envs/opensora/compiler_compat -shared -Wl,-rpath,/data1/ouyangtianjian/.conda/envs/opensora/lib -Wl,-rpath-link,/data1/ouyangtianjian/.conda/envs/opensora/lib -L/data1/ouyangtianjian/.conda/envs/opensora/lib -Wl,-rpath,/data1/ouyangtianjian/.conda/envs/opensora/lib -Wl,-rpath-link,/data1/ouyangtianjian/.conda/envs/opensora/lib -L/data1/ouyangtianjian/.conda/envs/opensora/lib /data1/ouyangtianjian/apex-22.04-dev/build/temp.linux-x86_64-cpython-310/csrc/flatten_unflatten.o -L/data1/ouyangtianjian/.conda/envs/opensora/lib/python3.10/site-packages/torch/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -o build/lib.linux-x86_64-cpython-310/apex_C.cpython-310-x86_64-linux-gnu.so building 'amp_C' extension Emitting ninja build file /data1/ouyangtianjian/apex-22.04-dev/build/temp.linux-x86_64-cpython-310/build.ninja... Compiling objects... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/14] /data1/ouyangtianjian/.conda/envs/opensora/bin/nvcc -I/data1/ouyangtianjian/.conda/envs/opensora/lib/python3.10/site-packages/torch/include -I/data1/ouyangtianjian/.conda/envs/opensora/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/data1/ouyangtianjian/.conda/envs/opensora/lib/python3.10/site-packages/torch/include/TH -I/data1/ouyangtianjian/.conda/envs/opensora/lib/python3.10/site-packages/torch/include/THC -I/data1/ouyangtianjian/.conda/envs/opensora/include -I/data1/ouyangtianjian/.conda/envs/opensora/include/python3.10 -c -c /data1/ouyangtianjian/apex-22.04-dev/csrc/multi_tensor_novograd.cu -o /data1/ouyangtianjian/apex-22.04-dev/build/temp.linux-x86_64-cpython-310/csrc/multi_tensor_novograd.o -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -lineinfo -O3 --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="gcc"' '-DPYBIND11_STDLIB="libstdcpp"' '-DPYBIND11_BUILD_ABI="cxxabi1011"' -DTORCH_EXTENSION_NAME=amp_C -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17 FAILED: /data1/ouyangtianjian/apex-22.04-dev/build/temp.linux-x86_64-cpython-310/csrc/multi_tensor_novograd.o /data1/ouyangtianjian/.conda/envs/opensora/bin/nvcc -I/data1/ouyangtianjian/.conda/envs/opensora/lib/python3.10/site-packages/torch/include -I/data1/ouyangtianjian/.conda/envs/opensora/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/data1/ouyangtianjian/.conda/envs/opensora/lib/python3.10/site-packages/torch/include/TH -I/data1/ouyangtianjian/.conda/envs/opensora/lib/python3.10/site-packages/torch/include/THC -I/data1/ouyangtianjian/.conda/envs/opensora/include -I/data1/ouyangtianjian/.conda/envs/opensora/include/python3.10 -c -c /data1/ouyangtianjian/apex-22.04-dev/csrc/multi_tensor_novograd.cu -o /data1/ouyangtianjian/apex-22.04-dev/build/temp.linux-x86_64-cpython-310/csrc/multi_tensor_novograd.o -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -lineinfo -O3 --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=amp_C -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17 In file included from /data1/ouyangtianjian/apex-22.04-dev/csrc/multi_tensor_novograd.cu:3: /data1/ouyangtianjian/.conda/envs/opensora/lib/python3.10/site-packages/torch/include/ATen/cuda/CUDAContext.h:6:10: fatal error: cusparse.h: No such file or directory 6 | #include <cusparse.h> | ^~~~~~~~~~~~ compilation terminated.

TJ-Ouyang avatar Apr 02 '24 09:04 TJ-Ouyang

Try reinstalling your system nv driver to the same version?

Edenzzzz avatar Apr 02 '24 09:04 Edenzzzz

Try reinstalling your system nv driver to the same version?

The sad news is the server manager rufuse to modify nv driver version (now is 12.2) because lots of people are using the GPU. And it seems that pytorch for CUDA 12.2 hasn't been released. Anyway, still thank you for your help.

TJ-Ouyang avatar Apr 02 '24 12:04 TJ-Ouyang

This is a common error due to your systems' global Nvidia driver (12.2) and pytorch cuda (12.1) version mismatch. You should comment out this image

Finally found the solution: https://github.com/NVIDIA/apex/pull/323#discussion_r287021798

I should not comment out the whole function. Only "if (bare_metal_major != torch_binary_major) or (bare_metal_minor != torch_binary_minor):" part needs to be deleted.

TJ-Ouyang avatar Apr 02 '24 16:04 TJ-Ouyang

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] avatar Apr 10 '24 01:04 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

github-actions[bot] avatar Apr 17 '24 01:04 github-actions[bot]