TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

server.cc:251] failed to enable peer access for some device pairs

Open Godlovecui opened this issue 1 year ago • 3 comments

System Info

RTX 8*4090 version: TensorRT-LLM: v0.9.0 tensorrtllm_backend: v0.9.0

Who can help?

@kaiyux @BY

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

None

Expected behavior

None

actual behavior

None

additional notes

When I deploy llama3-8B in trition server, it raises below error: image but, it also print server launch successfully flag: image However, when I send requests to server, image image

How to fix it? Thank you~

Godlovecui avatar Jun 08 '24 06:06 Godlovecui

Hi @Godlovecui , I saw u're using the 0.9.0 trtllm, is it possible to try the latest main branch and see if the issue still exists or not?

nv-guomingz avatar Jun 11 '24 06:06 nv-guomingz

Have you tried

nvidia-smi topo -p2p r

To inspect if the drivers for your GPUS are installed and support the peer to peer access?

Also I have encounterd similar issues where my default GPU installation required me to compile with disabled on the use_custom_all_reduce flag

TheCodeWrangler avatar Jun 17 '24 13:06 TheCodeWrangler

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

github-actions[bot] avatar Jul 18 '24 01:07 github-actions[bot]