feature(wgt): enable DI using torch-rpc to support GPU-p2p and RDMA-rpc
Commit:
-
Add torchrpc message queue.
-
Implement buffer based on CUDA-shared-tensor to optimize the data path of torchrpc.
-
Add 'bypass_eventloop' arg in Task() and Parallel().
-
Add thread lock in distributer.py to prevent sender and receiver competition.
-
Add message queue perf test for torchrpc, nccl, nng, shm
-
Add comm_perf_helper.py to make program timing more convenient.
-
Modified the subscribe() of class MQ, adding 'fn' parameter and 'is_once' parameter.
-
Add new DummyLock and ConditionLock type in lock_helper.py
-
Add message queues perf test.
-
Introduced a new self-hosted runner to execute cuda, multiprocess, torchrpc related tests.
Description
DI-engine integrates torch.distributed.rpc module.
- CPU-P2P-RDMA: In IB network environment, support RDMA CPU-P2P transmission
- GPU-P2P-RDMA: supports GPU p2p communication
cli-ditask introduces new command line arguments
-
--mq_type: Introducedtorchrpc:cudaandtorchrpc:cpuoptions-
torchrpc:cuda: Use torchrpc for communication, and allow setting device_map, can use GPU direct RDMA. -
torchrpc:cpu:Use torchrpc for communication, but device_map is not allowed to be set. All data on the GPU side will be copied to the CPU side for transmission.
-
-
--init-method: Initialization entry for init_rpc (required if --mq_type is torchrpc) -
--local-cuda-devices: Set the rank range of local GPUs that can be used (optional, default is all visible devices) -
--cuda-device-map: Used to set device_map, the format is as follows:Format: <Peer node id>_<Local GPU rank>_<Peer GPU rank>,[...] example: --cuda-device-map=1_0_1,2_0_2,0_0_0 pytorch api: options.set_device_map("worker1", {1: 2})(Optional, the default is to map all visible GPU to the GPU-0 of the peer)
Dynamic GPU communication groups
We create devices mappings between all possible devices in advance. This mapping is all-2-all, which can cover all communication situations. The purpose is to avoid errors caused by incomplete devicemap coverage. Setting redundant mappings will not have any side effects. The mappings are used to check the validity of the device during transport. Only after a new process joins the communication group will it try to create a channel based on these maps.
Node_0 device_maps:
("Node_1", {0: 0}), ("Node_2", {0: 0}), ...., ("Node_99", {0: 0})
Node_1 device_maps:
("Node_0", {0: 0}), ("Node_2", {0: 0}), ...., ("Node_99", {0: 0})
At the same time, we still expose the --cuda-device-map interface, which is used to allow users to configure the topology between devices, torchrpc will follow user input.
Related Issue
TODO
Load balancing capability: in a time-heterogeneous RL task environment, each worker can run at full capacity.
Check List
- [x] merge the latest version source branch/repo, and resolve all the conflicts
- [x] pass style check
- [ ] pass all the tests
Codecov Report
Merging #562 (3fa5319) into main (f798002) will decrease coverage by
1.20%. The diff coverage is37.75%.
:exclamation: Current head 3fa5319 differs from pull request most recent head e32055b. Consider uploading reports for the commit e32055b to get more accurate results
@@ Coverage Diff @@
## main #562 +/- ##
==========================================
- Coverage 83.60% 82.41% -1.20%
==========================================
Files 565 571 +6
Lines 46375 47198 +823
==========================================
+ Hits 38774 38900 +126
- Misses 7601 8298 +697
| Flag | Coverage Δ | |
|---|---|---|
| unittests | 82.41% <37.75%> (-1.20%) |
:arrow_down: |
Flags with carried forward coverage won't be shown. Click here to find out more.
| Impacted Files | Coverage Δ | |
|---|---|---|
| ding/torch_utils/data_helper.py | 76.82% <0.00%> (-1.47%) |
:arrow_down: |
| ding/data/tests/test_shm_buffer.py | 20.58% <16.07%> (-29.42%) |
:arrow_down: |
| ding/framework/message_queue/perfs/perf_nng.py | 16.83% <16.83%> (ø) |
|
| ...ramework/message_queue/perfs/perf_torchrpc_nccl.py | 19.49% <19.49%> (ø) |
|
| ding/framework/message_queue/perfs/perf_shm.py | 25.84% <25.84%> (ø) |
|
| ding/framework/parallel.py | 66.21% <33.91%> (-19.22%) |
:arrow_down: |
| ding/data/shm_buffer.py | 60.19% <34.42%> (-37.59%) |
:arrow_down: |
| ding/utils/comm_perf_helper.py | 35.82% <35.82%> (ø) |
|
| ding/envs/env_manager/subprocess_env_manager.py | 74.63% <38.46%> (-0.26%) |
:arrow_down: |
| ding/utils/lock_helper.py | 81.08% <41.17%> (-11.91%) |
:arrow_down: |
| ... and 118 more |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.