server.cc:251] failed to enable peer access for some device pairs
System Info
RTX 8*4090 version: TensorRT-LLM: v0.9.0 tensorrtllm_backend: v0.9.0
Who can help?
@kaiyux @BY
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
None
Expected behavior
None
actual behavior
None
additional notes
When I deploy llama3-8B in trition server, it raises below error:
but, it also print server launch successfully flag:
However, when I send requests to server,
How to fix it? Thank you~
Hi @Godlovecui , I saw u're using the 0.9.0 trtllm, is it possible to try the latest main branch and see if the issue still exists or not?
Have you tried
nvidia-smi topo -p2p r
To inspect if the drivers for your GPUS are installed and support the peer to peer access?
Also I have encounterd similar issues where my default GPU installation required me to compile with disabled on the use_custom_all_reduce flag
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."