[Model/Framework] What is the problem?RUN pip install lamb_amp_opt/
Related to Model/Pytorch (e.g. GNMT/PyTorch or FasterTransformer/All)
Describe the bug when i when i run - bash scripts/docker/build.sh , at the last step of the build, i am getting below error
=> ERROR [11/11] RUN pip install lamb_amp_opt/ 1.5s
[11/11] RUN pip install lamb_amp_opt/: 0.490 Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com 0.493 Processing ./lamb_amp_opt 0.493 DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default. 0.493 pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555. 0.672 ERROR: Command errored out with exit status 1: 0.672 command: /opt/conda/bin/python3.8 -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-e2j59bku/setup.py'"'"'; file='"'"'/tmp/pip-req-build-e2j59bku/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(file) if os.path.exists(file) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-iyrisslh 0.672 cwd: /tmp/pip-req-build-e2j59bku/ 0.672 Complete output (13 lines): 0.672 Traceback (most recent call last): 0.672 File "
", line 1, in 0.672 File "/opt/conda/lib/python3.8/site-packages/setuptools/init.py", line 27, in 0.672 from .dist import Distribution 0.672 File "/opt/conda/lib/python3.8/site-packages/setuptools/dist.py", line 30, in 0.672 from . import ( 0.672 File "/opt/conda/lib/python3.8/site-packages/setuptools/_entry_points.py", line 6, in 0.672 from jaraco.text import yield_lines 0.672 File "/opt/conda/lib/python3.8/site-packages/setuptools/_vendor/jaraco/text/init.py", line 12, in 0.672 from jaraco.context import ExceptionTrap 0.672 File "/opt/conda/lib/python3.8/site-packages/setuptools/_vendor/jaraco/context.py", line 17, in 0.672 from backports import tarfile 0.672 ImportError: cannot import name 'tarfile' from 'backports' (/opt/conda/lib/python3.8/site-packages/backports/init.py) 0.672 ---------------------------------------- 0.673 WARNING: Discarding file:///workspace/bert/lamb_amp_opt. Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output. 0.673 ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
Dockerfile:35
33 | RUN python -m nltk.downloader punkt 34 | 35 | >>> RUN pip install lamb_amp_opt/ 36 |
ERROR: failed to solve: process "/bin/sh -c pip install lamb_amp_opt/" did not complete successfully: exit code: 1
To Reproduce Steps to reproduce the behavior:
- Install '...'Clone BERT mode, install requirements,
- Set "..."
- Launch '.bash scripts/docker/build.sh'
Expected behavior A clear and concise description of what you expected to happen.
Environment Please provide at least: python: 3.6
- GPUs in the system: (e.g. 8x Tesla V100-SXM2-16GB): 2 x H100 NVL
- CUDA driver version (e.g. 418.67):12.7
Hi @saliedev051
I encountered this exact issue as well. This appears to be a dependency conflict related to recent versions of setuptools which are not fully compatible with the older build scripts for this BERT example.
The following workaround resolved the problem for me. You can try modifying the Dockerfile to pin setuptools to an older version and explicitly install the required dependencies before the failing step.
In the Dockerfile, right before the RUN pip install lamb_amp_opt/ line, add this command:
RUN pip install "setuptools<58" wheel backports.tarfile
This should downgrade setuptools to a compatible version and provide the missing backports.tarfile module, which resolves the ImportError and allows the build to complete successfully. Hope this helps!
Hi @saliedev051
For some reason in my case the issue came from the cuda version conflict,
I replaced ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:21.11-py3 with ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:23.05-py3 and it worked.
Hope this helps