TensorRT-LLM "peer access is not supported between these two devices" when using multiple GPUs

setting: aws g5.48xlarge This code worked fine when using single gpu, and failed when trying to use more. I also increase the --shm-size to 20G, not working.

Can this problem be fixed? Or g5 is not supported at all?

root@8c64641a157c:/TensorRT-LLM/examples/mpt# mpirun -n 4 --allow-run-as-root \
    python3 ../run.py --max_output_len 100 \
                     --engine_dir ./trt_engines/mpt-7b/fp16_tp4 \
                     --tokenizer_dir mosaicml/mpt-7b
tokenizer_config.json: 100%|██████████| 237/237 [00:00<00:00, 2.31MB/s]
tokenizer.json: 100%|██████████| 2.11M/2.11M [00:00<00:00, 43.6MB/s]
special_tokens_map.json: 100%|██████████| 99.0/99.0 [00:00<00:00, 1.30MB/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[TensorRT-LLM][INFO] Engine version 0.8.0.dev2024013000 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][INFO] MPI size: 4, rank: 1
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[TensorRT-LLM][INFO] Engine version 0.8.0.dev2024013000 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][INFO] MPI size: 4, rank: 0
[TensorRT-LLM][INFO] Engine version 0.8.0.dev2024013000 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][INFO] MPI size: 4, rank: 2
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[TensorRT-LLM][INFO] Engine version 0.8.0.dev2024013000 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][INFO] MPI size: 4, rank: 3
[TensorRT-LLM][WARNING] Device 1 peer access Device 0 is not available.
[TensorRT-LLM][WARNING] Device 1 peer access Device 2 is not available.
[TensorRT-LLM][WARNING] Device 1 peer access Device 3 is not available.
[TensorRT-LLM][WARNING] Device 1 peer access Device 4 is not available.
[TensorRT-LLM][WARNING] Device 1 peer access Device 5 is not available.
[TensorRT-LLM][WARNING] Device 1 peer access Device 6 is not available.
[TensorRT-LLM][WARNING] Device 1 peer access Device 7 is not available.
[TensorRT-LLM][INFO] Loaded engine size: 3568 MiB
[TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 2 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 3 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 4 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 5 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 6 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 7 is not available.
[TensorRT-LLM][INFO] Loaded engine size: 3568 MiB
[TensorRT-LLM][WARNING] Device 3 peer access Device 0 is not available.
[TensorRT-LLM][WARNING] Device 3 peer access Device 1 is not available.
[TensorRT-LLM][WARNING] Device 3 peer access Device 2 is not available.
[TensorRT-LLM][WARNING] Device 3 peer access Device 4 is not available.
[TensorRT-LLM][WARNING] Device 3 peer access Device 5 is not available.
[TensorRT-LLM][WARNING] Device 3 peer access Device 6 is not available.
[TensorRT-LLM][WARNING] Device 3 peer access Device 7 is not available.
[TensorRT-LLM][WARNING] Device 2 peer access Device 0 is not available.
[TensorRT-LLM][WARNING] Device 2 peer access Device 1 is not available.
[TensorRT-LLM][WARNING] Device 2 peer access Device 3 is not available.
[TensorRT-LLM][WARNING] Device 2 peer access Device 4 is not available.
[TensorRT-LLM][WARNING] Device 2 peer access Device 5 is not available.
[TensorRT-LLM][WARNING] Device 2 peer access Device 6 is not available.
[TensorRT-LLM][WARNING] Device 2 peer access Device 7 is not available.
[TensorRT-LLM][INFO] Loaded engine size: 3568 MiB
[TensorRT-LLM][INFO] Loaded engine size: 3568 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 3719, GPU 3844 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 3721, GPU 3854 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 3719, GPU 3844 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 3721, GPU 3854 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 3719, GPU 3844 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 3719, GPU 3844 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 3721, GPU 3854 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 3721, GPU 3854 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +3565, now: CPU 0, GPU 3565 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +3565, now: CPU 0, GPU 3565 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +3565, now: CPU 0, GPU 3565 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +3565, now: CPU 0, GPU 3565 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 3957, GPU 5234 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 3957, GPU 5242 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 3957, GPU 5234 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 3957, GPU 5234 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 3957, GPU 5234 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 3957, GPU 5242 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 3957, GPU 5242 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 3957, GPU 5242 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 3565 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 3565 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 3565 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 3565 (MiB)
[TensorRT-LLM][INFO] Allocate 16525557760 bytes for k/v cache. 
[TensorRT-LLM][INFO] Using 126080 tokens in paged KV cache.
[TensorRT-LLM][INFO] Allocate 16525557760 bytes for k/v cache. 
[TensorRT-LLM][INFO] Using 126080 tokens in paged KV cache.
[TensorRT-LLM][INFO] Allocate 16525557760 bytes for k/v cache. 
[TensorRT-LLM][INFO] Using 126080 tokens in paged KV cache.
[TensorRT-LLM][INFO] Allocate 16525557760 bytes for k/v cache. 
[TensorRT-LLM][INFO] Using 126080 tokens in paged KV cache.
[TensorRT-LLM] TensorRT-LLM version: 0.8.0.dev2024013000Traceback (most recent call last):
  File "/TensorRT-LLM/examples/mpt/../run.py", line 504, in <module>
    main(args)
  File "/TensorRT-LLM/examples/mpt/../run.py", line 379, in main
    runner = runner_cls.from_dir(**runner_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 169, in from_dir
    session = GptSession(config=session_config,
RuntimeError: [TensorRT-LLM][ERROR] CUDA runtime error in error: peer access is not supported between these two devices (/home/jenkins/agent/workspace/LLM/main/L0_MergeRequest/tensorrt_llm/cpp/tensorrt_llm/runtime/ipcUtils.cpp:48)
1       0x7fb9c0257f2b /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x7aff2b) [0x7fb9c0257f2b]
2       0x7fb9c1e08668 tensorrt_llm::runtime::setPeerAccess(tensorrt_llm::runtime::WorldConfig, bool) + 216
3       0x7fb9c1df2bba tensorrt_llm::runtime::GptSession::createCustomAllReduceWorkspace(int, int, int) + 202
4       0x7fb9c1df392d tensorrt_llm::runtime::GptSession::setup(tensorrt_llm::runtime::GptSession::Config const&) + 1117
5       0x7fb9c1df3d71 tensorrt_llm::runtime::GptSession::GptSession(tensorrt_llm::runtime::GptSession::Config const&, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, void const*, unsigned long, std::shared_ptr<nvinfer1::ILogger>) + 977
6       0x7fbac5ec44d4 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x6a4d4) [0x7fbac5ec44d4]
7       0x7fbac5e9d5c9 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x435c9) [0x7fbac5e9d5c9]
8       0x7fbac5e87120 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x2d120) [0x7fbac5e87120]
9       0x5596c0faa10e python3(+0x15a10e) [0x5596c0faa10e]
10      0x5596c0fa0a7b _PyObject_MakeTpCall + 603
11      0x5596c0fb8acb python3(+0x168acb) [0x5596c0fb8acb]
12      0x5596c0fb9635 _PyObject_Call + 277
13      0x5596c0fb5087 python3(+0x165087) [0x5596c0fb5087]
14      0x5596c0fa0e2b python3(+0x150e2b) [0x5596c0fa0e2b]
15      0x7fbac5e867d9 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x2c7d9) [0x7fbac5e867d9]
16      0x5596c0fa0a7b _PyObject_MakeTpCall + 603
17      0x5596c0f9a150 _PyEval_EvalFrameDefault + 30112
18      0x5596c0fb87f1 python3(+0x1687f1) [0x5596c0fb87f1]
19      0x5596c0fb9492 PyObject_Call + 290
20      0x5596c0f955d7 _PyEval_EvalFrameDefault + 10791
21      0x5596c0faa9fc _PyFunction_Vectorcall + 124
22      0x5596c0f9326d _PyEval_EvalFrameDefault + 1725
23      0x5596c0f8f9c6 python3(+0x13f9c6) [0x5596c0f8f9c6]
24      0x5596c1085256 PyEval_EvalCode + 134
25      0x5596c10b0108 python3(+0x260108) [0x5596c10b0108]
26      0x5596c10a99cb python3(+0x2599cb) [0x5596c10a99cb]
27      0x5596c10afe55 python3(+0x25fe55) [0x5596c10afe55]
28      0x5596c10af338 _PyRun_SimpleFileObject + 424
29      0x5596c10aef83 _PyRun_AnyFileObject + 67
30      0x5596c10a1a5e Py_RunMain + 702
31      0x5596c107802d Py_BytesMain + 45
32      0x7fbc4549bd90 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fbc4549bd90]
33      0x7fbc4549be40 __libc_start_main + 128
34      0x5596c1077f25 _start + 37
Traceback (most recent call last):
  File "/TensorRT-LLM/examples/mpt/../run.py", line 504, in <module>
    main(args)
  File "/TensorRT-LLM/examples/mpt/../run.py", line 379, in main
    runner = runner_cls.from_dir(**runner_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 169, in from_dir
    session = GptSession(config=session_config,
RuntimeError: [TensorRT-LLM][ERROR] CUDA runtime error in error: peer access is not supported between these two devices (/home/jenkins/agent/workspace/LLM/main/L0_MergeRequest/tensorrt_llm/cpp/tensorrt_llm/runtime/ipcUtils.cpp:48)
1       0x7fafb7057f2b /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x7aff2b) [0x7fafb7057f2b]
2       0x7fafb8c08668 tensorrt_llm::runtime::setPeerAccess(tensorrt_llm::runtime::WorldConfig, bool) + 216
3       0x7fafb8bf2bba tensorrt_llm::runtime::GptSession::createCustomAllReduceWorkspace(int, int, int) + 202
4       0x7fafb8bf392d tensorrt_llm::runtime::GptSession::setup(tensorrt_llm::runtime::GptSession::Config const&) + 1117
5       0x7fafb8bf3d71 tensorrt_llm::runtime::GptSession::GptSession(tensorrt_llm::runtime::GptSession::Config const&, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, void const*, unsigned long, std::shared_ptr<nvinfer1::ILogger>) + 977
6       0x7fb0bccc44d4 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x6a4d4) [0x7fb0bccc44d4]
7       0x7fb0bcc9d5c9 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x435c9) [0x7fb0bcc9d5c9]
8       0x7fb0bcc87120 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x2d120) [0x7fb0bcc87120]
9       0x55c9b511a10e python3(+0x15a10e) [0x55c9b511a10e]
10      0x55c9b5110a7b _PyObject_MakeTpCall + 603
11      0x55c9b5128acb python3(+0x168acb) [0x55c9b5128acb]
12      0x55c9b5129635 _PyObject_Call + 277
13      0x55c9b5125087 python3(+0x165087) [0x55c9b5125087]
14      0x55c9b5110e2b python3(+0x150e2b) [0x55c9b5110e2b]
15      0x7fb0bcc867d9 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x2c7d9) [0x7fb0bcc867d9]
16      0x55c9b5110a7b _PyObject_MakeTpCall + 603
17      0x55c9b510a150 _PyEval_EvalFrameDefault + 30112
18      0x55c9b51287f1 python3(+0x1687f1) [0x55c9b51287f1]
19      0x55c9b5129492 PyObject_Call + 290
20      0x55c9b51055d7 _PyEval_EvalFrameDefault + 10791
21      0x55c9b511a9fc _PyFunction_Vectorcall + 124
22      0x55c9b510326d _PyEval_EvalFrameDefault + 1725
23      0x55c9b50ff9c6 python3(+0x13f9c6) [0x55c9b50ff9c6]
24      0x55c9b51f5256 PyEval_EvalCode + 134
25      0x55c9b5220108 python3(+0x260108) [0x55c9b5220108]
26      0x55c9b52199cb python3(+0x2599cb) [0x55c9b52199cb]
27      0x55c9b521fe55 python3(+0x25fe55) [0x55c9b521fe55]
28      0x55c9b521f338 _PyRun_SimpleFileObject + 424
29      0x55c9b521ef83 _PyRun_AnyFileObject + 67
30      0x55c9b5211a5e Py_RunMain + 702
31      0x55c9b51e802d Py_BytesMain + 45
32      0x7fb23c1a2d90 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fb23c1a2d90]
33      0x7fb23c1a2e40 __libc_start_main + 128
34      0x55c9b51e7f25 _start + 37
Traceback (most recent call last):
  File "/TensorRT-LLM/examples/mpt/../run.py", line 504, in <module>
[TensorRT-LLM] TensorRT-LLM version: 0.8.0.dev2024013000    main(args)
  File "/TensorRT-LLM/examples/mpt/../run.py", line 379, in main
    runner = runner_cls.from_dir(**runner_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 169, in from_dir
    session = GptSession(config=session_config,
RuntimeError: [TensorRT-LLM][ERROR] CUDA runtime error in error: peer access is not supported between these two devices (/home/jenkins/agent/workspace/LLM/main/L0_MergeRequest/tensorrt_llm/cpp/tensorrt_llm/runtime/ipcUtils.cpp:48)
1       0x7fecf8057f2b /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x7aff2b) [0x7fecf8057f2b]
2       0x7fecf9c08668 tensorrt_llm::runtime::setPeerAccess(tensorrt_llm::runtime::WorldConfig, bool) + 216
3       0x7fecf9bf2bba tensorrt_llm::runtime::GptSession::createCustomAllReduceWorkspace(int, int, int) + 202
4       0x7fecf9bf392d tensorrt_llm::runtime::GptSession::setup(tensorrt_llm::runtime::GptSession::Config const&) + 1117
5       0x7fecf9bf3d71 tensorrt_llm::runtime::GptSession::GptSession(tensorrt_llm::runtime::GptSession::Config const&, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, void const*, unsigned long, std::shared_ptr<nvinfer1::ILogger>) + 977
6       0x7fedfdcc44d4 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x6a4d4) [0x7fedfdcc44d4]
7       0x7fedfdc9d5c9 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x435c9) [0x7fedfdc9d5c9]
8       0x7fedfdc87120 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x2d120) [0x7fedfdc87120]
9       0x5629a22f910e python3(+0x15a10e) [0x5629a22f910e]
10      0x5629a22efa7b _PyObject_MakeTpCall + 603
11      0x5629a2307acb python3(+0x168acb) [0x5629a2307acb]
12      0x5629a2308635 _PyObject_Call + 277
13      0x5629a2304087 python3(+0x165087) [0x5629a2304087]
14      0x5629a22efe2b python3(+0x150e2b) [0x5629a22efe2b]
15      0x7fedfdc867d9 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x2c7d9) [0x7fedfdc867d9]
16      0x5629a22efa7b _PyObject_MakeTpCall + 603
17      0x5629a22e9150 _PyEval_EvalFrameDefault + 30112
18      0x5629a23077f1 python3(+0x1687f1) [0x5629a23077f1]
19      0x5629a2308492 PyObject_Call + 290
20      0x5629a22e45d7 _PyEval_EvalFrameDefault + 10791
21      0x5629a22f99fc _PyFunction_Vectorcall + 124
22      0x5629a22e226d _PyEval_EvalFrameDefault + 1725
23      0x5629a22de9c6 python3(+0x13f9c6) [0x5629a22de9c6]
24      0x5629a23d4256 PyEval_EvalCode + 134
25      0x5629a23ff108 python3(+0x260108) [0x5629a23ff108]
26      0x5629a23f89cb python3(+0x2599cb) [0x5629a23f89cb]
27      0x5629a23fee55 python3(+0x25fe55) [0x5629a23fee55]
28      0x5629a23fe338 _PyRun_SimpleFileObject + 424
29      0x5629a23fdf83 _PyRun_AnyFileObject + 67
30      0x5629a23f0a5e Py_RunMain + 702
31      0x5629a23c702d Py_BytesMain + 45
32      0x7fef7d18ad90 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fef7d18ad90]
33      0x7fef7d18ae40 __libc_start_main + 128
34      0x5629a23c6f25 _start + 37
Traceback (most recent call last):
  File "/TensorRT-LLM/examples/mpt/../run.py", line 504, in <module>
    main(args)
  File "/TensorRT-LLM/examples/mpt/../run.py", line 379, in main
    runner = runner_cls.from_dir(**runner_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 169, in from_dir
    session = GptSession(config=session_config,
RuntimeError: [TensorRT-LLM][ERROR] CUDA runtime error in error: peer access is not supported between these two devices (/home/jenkins/agent/workspace/LLM/main/L0_MergeRequest/tensorrt_llm/cpp/tensorrt_llm/runtime/ipcUtils.cpp:48)
1       0x7f5db5a57f2b /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x7aff2b) [0x7f5db5a57f2b]
2       0x7f5db7608668 tensorrt_llm::runtime::setPeerAccess(tensorrt_llm::runtime::WorldConfig, bool) + 216
3       0x7f5db75f2bba tensorrt_llm::runtime::GptSession::createCustomAllReduceWorkspace(int, int, int) + 202
4       0x7f5db75f392d tensorrt_llm::runtime::GptSession::setup(tensorrt_llm::runtime::GptSession::Config const&) + 1117
5       0x7f5db75f3d71 tensorrt_llm::runtime::GptSession::GptSession(tensorrt_llm::runtime::GptSession::Config const&, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, void const*, unsigned long, std::shared_ptr<nvinfer1::ILogger>) + 977
6       0x7f5ebb6c44d4 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x6a4d4) [0x7f5ebb6c44d4]
7       0x7f5ebb69d5c9 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x435c9) [0x7f5ebb69d5c9]
8       0x7f5ebb687120 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x2d120) [0x7f5ebb687120]
9       0x562e5538710e python3(+0x15a10e) [0x562e5538710e]
10      0x562e5537da7b _PyObject_MakeTpCall + 603
11      0x562e55395acb python3(+0x168acb) [0x562e55395acb]
12      0x562e55396635 _PyObject_Call + 277
13      0x562e55392087 python3(+0x165087) [0x562e55392087]
14      0x562e5537de2b python3(+0x150e2b) [0x562e5537de2b]
15      0x7f5ebb6867d9 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x2c7d9) [0x7f5ebb6867d9]
16      0x562e5537da7b _PyObject_MakeTpCall + 603
17      0x562e55377150 _PyEval_EvalFrameDefault + 30112
18      0x562e553957f1 python3(+0x1687f1) [0x562e553957f1]
19      0x562e55396492 PyObject_Call + 290
20      0x562e553725d7 _PyEval_EvalFrameDefault + 10791
21      0x562e553879fc _PyFunction_Vectorcall + 124
22      0x562e5537026d _PyEval_EvalFrameDefault + 1725
23      0x562e5536c9c6 python3(+0x13f9c6) [0x562e5536c9c6]
24      0x562e55462256 PyEval_EvalCode + 134
25      0x562e5548d108 python3(+0x260108) [0x562e5548d108]
26      0x562e554869cb python3(+0x2599cb) [0x562e554869cb]
27      0x562e5548ce55 python3(+0x25fe55) [0x562e5548ce55]
28      0x562e5548c338 _PyRun_SimpleFileObject + 424
29      0x562e5548bf83 _PyRun_AnyFileObject + 67
30      0x562e5547ea5e Py_RunMain + 702
31      0x562e5545502d Py_BytesMain + 45
32      0x7f603abaad90 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f603abaad90]
33      0x7f603abaae40 __libc_start_main + 128
34      0x562e55454f25 _start + 37
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[34453,1],2]
  Exit code:    1
--------------------------------------------------------------------------

Feb 01 '24 19:02 TobyGE

@TobyGE Hi bro,

I am also using the G5 series as my main instances (G5.xlarge, G5.12xlarge). Everything is fine.

This is my setup: git submodule update --init --recursive && git lfs install && git lfs pull && make -C docker release_build CUDA_ARCHS="86-real" && make -C docker release_run NVIDIA driver version 535.104.12 OS: Ubuntu 20.04 (AMI ami-06a5005dc37cc37a1)

Could you try it again ?

Feb 02 '24 04:02 matichon-vultureprime

we got a similar problem before. hardwares:

x86_64
4xA10 w/o nvlink

we change the option and the error disappears: "use_custom_all_reduce": true -> false

have a try, since we got another problem...

Feb 02 '24 07:02 Alienfeel

@TobyGE Hi bro,

I am also using the G5 series as my main instances (G5.xlarge, G5.12xlarge). Everything is fine.

This is my setup: git submodule update --init --recursive && git lfs install && git lfs pull && make -C docker release_build CUDA_ARCHS="86-real" && make -C docker release_run NVIDIA driver version 535.104.12 OS: Ubuntu 20.04 (AMI ami-06a5005dc37cc37a1)

Could you try it again ?

Thanks. I was following the official instruction, and installed 22.04.

Feb 02 '24 08:02 TobyGE

we got a similar problem before. hardwares:

x86_64

4xA10 w/o nvlink

we change the option and the error disappears: "use_custom_all_reduce": true -> false

have a try, since we got another problem...

I've changed to p4d (A100), the issue is gone, however, OOM came up shortly, even when tp=8, which was so weird. I think the problem comes from the latest version of the main branch, as the same code works well with 0.8.0.dev2024012301

Feb 02 '24 08:02 TobyGE

I got the same problem with the v0.8.0 tag version. GPU 4090 * 4, Mixtral 8x7B int4.

Mar 13 '24 07:03 plt12138

Pls try use_custom_all_reduce as false

Mar 22 '24 14:03 litaotju

Pls try use_custom_all_reduce as false

Could you please tell me where to set this option?

Apr 02 '24 06:04 EheinWang

Append "--use_custom_all_reduce disable" to trtllm-build command can fix it. This is working for me on 8x4090

Apr 25 '24 09:04 yejingfu

Append "--use_custom_all_reduce disable" to trtllm-build command can fix it. This is working for me on 8x4090

Thank you. It worked for me as well. Thanks a lot!!

Jun 19 '24 14:06 alokkrsahu

Append "--use_custom_all_reduce disable" to trtllm-build command can fix it. This is working for me on 8x4090

3Q. It works for me with TP=2 on 4090.

Jul 23 '24 08:07 so2bin