Shangyan Zhou comments

Results 132 comments of


                                            Shangyan Zhou

Failed to run on H100 GPU with tensor para=8

@Wenhan-Tan I just encountered the same issue. The reason I ran into this problem was that I had enabled hugepages on the physical machine, and UCX triggered a SIGBUS when...

Failed to run on H100 GPU with tensor para=8

> @sphish Thank you! I saw another similar issue here ([NVIDIA/TensorRT-LLM#674](https://github.com/NVIDIA/TensorRT-LLM/issues/674)) which uses TRT-LLM instead of FT. But in that issue, huge pages need be enabled. I'll try disabling huge...

test_low_latency failed

What is your network hardware configuration? Could you please run `nvidia-smi topo -mp` and `ibv_devinfo` and share the results?

test_low_latency failed

> I'm seeing a similar issue: > > ``` > root@22f186c3783d:/workspace# > root@22f186c3783d:/workspace# nvidia-smi topo -mp > GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4...

test_low_latency failed

> [@sphish](https://github.com/sphish) Same issue. Any help? @liusy58 Can you run the NVSHMEM's `shmem_put_bw` test, and will you encounter the same issue?

test_low_latency failed

> [@sphish](https://github.com/sphish) Hi, output of `shmem_put_bw` is shown below. I cannot resolve this, could you please give me some guidance? > > ``` > /opt/nvshmem/bin/perftest/device/pt-to-pt/shmem_put_bw > Runtime options after parsing...

test_low_latency failed

> > > [@sphish](https://github.com/sphish) Same issue. Any help? > > > > > > [@liusy58](https://github.com/liusy58) Can you run the NVSHMEM's `shmem_put_bw` test, and will you encounter the same issue? >...

test_low_latency failed

@koanho Can you check if the nvidia-peermem module is correctly installed and loaded?

test_low_latency failed

@koanho Have you modified drvier config? https://github.com/deepseek-ai/DeepEP/tree/main/third-party#4-configure-nvidia-driver

test_low_latency failed

> Is IBGDA necessary to use DeepEP, right? @koanho If you want to use low latency mode, Yes. If you only want to use the normal mode for training, you...