LWM icon indicating copy to clipboard operation
LWM copied to clipboard

NCCL Error when running the Jax LWM-Chat-1M-Jax

Open jeffchy opened this issue 1 year ago • 0 comments

Environment

GPUs: 4x80G Package Version


absl-py 2.1.0 aiohttp 3.9.3 aiosignal 1.3.1 appdirs 1.4.4 asttokens 2.4.1 async-timeout 4.0.3 attrs 23.2.0 cachetools 5.3.2 certifi 2024.2.2 charset-normalizer 3.3.2 chex 0.1.82 click 8.1.7 cloudpickle 3.0.0 contextlib2 21.6.0 datasets 2.13.0 decorator 5.1.1 decord 0.6.0 dill 0.3.6 docker-pycreds 0.4.0 einops 0.7.0 etils 1.7.0 exceptiongroup 1.2.0 executing 2.0.1 filelock 3.13.1 flax 0.7.0 frozenlist 1.4.1 fsspec 2024.2.0 gcsfs 2024.2.0 gitdb 4.0.11 GitPython 3.1.42 google-api-core 2.17.1 google-auth 2.28.1 google-auth-oauthlib 1.2.0 google-cloud-core 2.4.1 google-cloud-storage 2.14.0 google-crc32c 1.5.0 google-resumable-media 2.7.0 googleapis-common-protos 1.62.0 huggingface-hub 0.20.3 idna 3.6 imageio 2.34.0 imageio-ffmpeg 0.4.9 importlib-resources 6.1.1 ipdb 0.13.13 ipython 8.21.0 jax 0.4.23 jaxlib 0.4.23+cuda12.cudnn89 jedi 0.19.1 markdown-it-py 3.0.0 matplotlib-inline 0.1.6 mdurl 0.1.2 ml-collections 0.1.1 ml-dtypes 0.3.2 msgpack 1.0.7 multidict 6.0.5 multiprocess 0.70.14 nest-asyncio 1.6.0 numpy 1.26.4 nvidia-cublas-cu12 12.3.4.1 nvidia-cuda-cupti-cu12 12.3.101 nvidia-cuda-nvcc-cu12 12.3.107 nvidia-cuda-nvrtc-cu12 12.3.107 nvidia-cuda-runtime-cu12 12.3.101 nvidia-cudnn-cu12 8.9.7.29 nvidia-cufft-cu12 11.0.12.1 nvidia-cusolver-cu12 11.5.4.101 nvidia-cusparse-cu12 12.2.0.103 nvidia-nccl-cu12 2.19.3 nvidia-nvjitlink-cu12 12.3.101 oauthlib 3.2.2 opt-einsum 3.3.0 optax 0.1.7 orbax-checkpoint 0.5.3 packaging 23.2 pandas 2.2.0 parso 0.8.3 pexpect 4.9.0 pillow 10.2.0 pip 23.3.1 prompt-toolkit 3.0.43 protobuf 4.25.3 psutil 5.9.8 ptyprocess 0.7.0 pure-eval 0.2.2 pyarrow 15.0.0 pyasn1 0.5.1 pyasn1-modules 0.3.0 Pygments 2.17.2 python-dateutil 2.8.2 pytz 2024.1 PyYAML 6.0.1 regex 2023.12.25 requests 2.31.0 requests-oauthlib 1.3.1 rich 13.7.0 rsa 4.9 scipy 1.12.0 sentencepiece 0.2.0 sentry-sdk 1.40.5 setproctitle 1.3.3 setuptools 68.2.2 six 1.16.0 smmap 5.0.1 stack-data 0.6.3 tensorstore 0.1.53 tiktoken 0.6.0 tokenizers 0.13.3 tomli 2.0.1 toolz 0.12.1 tqdm 4.66.2 traitlets 5.14.1 transformers 4.29.2 tux 0.0.2 typing_extensions 4.9.0 tzdata 2024.1 urllib3 2.2.1 wandb 0.16.3 wcwidth 0.2.13 wheel 0.41.2 xxhash 3.4.1 yarl 1.9.4 zipp 3.17.0

Error Messasge

`I0222 09:24:21.054814 140683333334848 xla_bridge.py:660] Unable to initialize backend 'rocm': NOT_FOUND: Could not find registered platform with name: "rocm". Available platform names are: CUDA I0222 09:24:21.056322 140683333334848 xla_bridge.py:660] Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory 2024-02-22 09:24:21.097023: W external/xla/xla/service/gpu/nvptx_compiler.cc:698] The NVIDIA driver's CUDA version is 12.2 which is older than the ptxas CUDA version (12.3.107). Because the driver is older than the ptxas version, XLA is disabling parallel compilation, which may slow down compilation. You should update your NVIDIA driver or use the NVIDIA-provided CUDA forward compatibility packages. 0%| | 0/1 [00:00<?, ?it/s]2024-02-22 09:25:32.707642: W external/xla/xla/service/gpu/runtime/support.cc:58] Intercepted XLA runtime error: INTERNAL: external/xla/xla/service/gpu/nccl_utils.cc:305: NCCL operation ncclCommInitRank(&comm, nranks, id, rank) failed: internal error - please report this issue to the NCCL developers. Last NCCL warning(error) log entry (may be unrelated) 'Attribute busid of node nic not found'. 2024-02-22 09:25:32.707708: E external/xla/xla/pjrt/pjrt_stream_executor_client.cc:2732] Execution of replica 0 failed: INTERNAL: Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.reduce_scatter' failed: external/xla/xla/service/gpu/nccl_utils.cc:305: NCCL operation ncclCommInitRank(&comm, nranks, id, rank) failed: internal error - please report this issue to the NCCL developers. Last NCCL warning(error) log entry (may be unrelated) 'Attribute busid of node nic not found'.; current tracing scope: reduce-scatter-start.5; current profiling annotation: XlaModule:#hlo_module=pjit__forward_generate,program_id=21#. 2024-02-22 09:25:32.807973: W external/xla/xla/service/gpu/runtime/support.cc:58] Intercepted XLA runtime error: INTERNAL: external/xla/xla/service/gpu/nccl_utils.cc:305: NCCL operation ncclCommInitRank(&comm, nranks, id, rank) failed: internal error - please report this issue to the NCCL developers. Last NCCL warning(error) log entry (may be unrelated) 'Attribute busid of node nic not found'. 2024-02-22 09:25:32.808024: E external/xla/xla/pjrt/pjrt_stream_executor_client.cc:2732] Execution of replica 0 failed: INTERNAL: Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.reduce_scatter' failed: external/xla/xla/service/gpu/nccl_utils.cc:305: NCCL operation ncclCommInitRank(&comm, nranks, id, rank) failed: internal error - please report this issue to the NCCL developers. Last NCCL warning(error) log entry (may be unrelated) 'Attribute busid of node nic not found'.; current tracing scope: reduce-scatter-start.5; current profiling annotation: XlaModule:#hlo_module=pjit__forward_generate,program_id=21#. 2024-02-22 09:25:32.821063: W external/xla/xla/service/gpu/runtime/support.cc:58] Intercepted XLA runtime error: INTERNAL: external/xla/xla/service/gpu/nccl_utils.cc:305: NCCL operation ncclCommInitRank(&comm, nranks, id, rank) failed: internal error - please report this issue to the NCCL developers. Last NCCL warning(error) log entry (may be unrelated) 'Attribute busid of node nic not found'. 2024-02-22 09:25:32.821116: E external/xla/xla/pjrt/pjrt_stream_executor_client.cc:2732] Execution of replica 0 failed: INTERNAL: Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.reduce_scatter' failed: external/xla/xla/service/gpu/nccl_utils.cc:305: NCCL operation ncclCommInitRank(&comm, nranks, id, rank) failed: internal error - please report this issue to the NCCL developers. Last NCCL warning(error) log entry (may be unrelated) 'Attribute busid of node nic not found'.; current tracing scope: reduce-scatter-start.5; current profiling annotation: XlaModule:#hlo_module=pjit__forward_generate,program_id=21#. 2024-02-22 09:25:32.825532: W external/xla/xla/service/gpu/runtime/support.cc:58] Intercepted XLA runtime error: INTERNAL: external/xla/xla/service/gpu/nccl_utils.cc:305: NCCL operation ncclCommInitRank(&comm, nranks, id, rank) failed: internal error - please report this issue to the NCCL developers. Last NCCL warning(error) log entry (may be unrelated) 'Attribute busid of node nic not found'. 2024-02-22 09:25:32.825585: E external/xla/xla/pjrt/pjrt_stream_executor_client.cc:2732] Execution of replica 0 failed: INTERNAL: Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.reduce_scatter' failed: external/xla/xla/service/gpu/nccl_utils.cc:305: NCCL operation ncclCommInitRank(&comm, nranks, id, rank) failed: internal error - please report this issue to the NCCL developers. Last NCCL warning(error) log entry (may be unrelated) 'Attribute busid of node nic not found'.; current tracing scope: reduce-scatter-start.5; current profiling annotation: XlaModule:#hlo_module=pjit__forward_generate,program_id=21#. 0%| | 0/1 [00:07<?, ?it/s] jax.errors.SimplifiedTraceback: For simplicity, JAX has removed its internal frames from the traceback of the following exception. Set JAX_TRACEBACK_FILTERING=off to include these.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/root/anaconda3/envs/lwm/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/root/anaconda3/envs/lwm/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/data/jeff/Git/LLM/src/T2V/LWM/lwm/vision_generation.py", line 258, in run(main) File "/root/anaconda3/envs/lwm/lib/python3.10/site-packages/absl/app.py", line 308, in run _run_main(main, args) File "/root/anaconda3/envs/lwm/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main sys.exit(main(argv)) File "/data/jeff/Git/LLM/src/T2V/LWM/lwm/vision_generation.py", line 184, in main img_enc, img = generate_first_frame(prompts, max_input_length=128) File "/data/jeff/Git/LLM/src/T2V/LWM/lwm/vision_generation.py", line 158, in generate_first_frame output, sharded_rng = _sharded_forward_generate( jaxlib.xla_extension.XlaRuntimeError: INTERNAL: Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.reduce_scatter' failed: external/xla/xla/service/gpu/nccl_utils.cc:305: NCCL operation ncclCommInitRank(&comm, nranks, id, rank) failed: internal error - please report this issue to the NCCL developers. Last NCCL warning(error) log entry (may be unrelated) 'Attribute busid of node nic not found'.; current tracing scope: reduce-scatter-start.5; current profiling annotation: XlaModule:#hlo_module=pjit__forward_generate,program_id=21#.: while running replica 0 and partition 0 of a replicated computation (other replicas may have failed`

jeffchy avatar Feb 23 '24 07:02 jeffchy