feature(wgt): enable DI using torch-rpc to support GPU-p2p and RDMA-rpc

Open SolenoidWGT opened this issue 3 years ago • 1 comments

Commit:

Add torchrpc message queue.
Implement buffer based on CUDA-shared-tensor to optimize the data path of torchrpc.
Add 'bypass_eventloop' arg in Task() and Parallel().
Add thread lock in distributer.py to prevent sender and receiver competition.
Add message queue perf test for torchrpc, nccl, nng, shm
Add comm_perf_helper.py to make program timing more convenient.
Modified the subscribe() of class MQ, adding 'fn' parameter and 'is_once' parameter.
Add new DummyLock and ConditionLock type in lock_helper.py
Add message queues perf test.
Introduced a new self-hosted runner to execute cuda, multiprocess, torchrpc related tests.

Description

DI-engine integrates torch.distributed.rpc module.

CPU-P2P-RDMA: In IB network environment, support RDMA CPU-P2P transmission
GPU-P2P-RDMA: supports GPU p2p communication

cli-ditask introduces new command line arguments

--mq_type: Introduced torchrpc:cuda and torchrpc:cpu options
1. torchrpc:cuda: Use torchrpc for communication, and allow setting device_map, can use GPU direct RDMA.
2. torchrpc:cpu: Use torchrpc for communication, but device_map is not allowed to be set. All data on the GPU side will be copied to the CPU side for transmission.
--init-method: Initialization entry for init_rpc (required if --mq_type is torchrpc)
--local-cuda-devices: Set the rank range of local GPUs that can be used (optional, default is all visible devices)

--cuda-device-map: Used to set device_map, the format is as follows:

Format:
<Peer node id>_<Local GPU rank>_<Peer GPU rank>,[...]
example:
--cuda-device-map=1_0_1,2_0_2,0_0_0

pytorch api:
options.set_device_map("worker1", {1: 2})

(Optional, the default is to map all visible GPU to the GPU-0 of the peer)

Dynamic GPU communication groups

We create devices mappings between all possible devices in advance. This mapping is all-2-all, which can cover all communication situations. The purpose is to avoid errors caused by incomplete devicemap coverage. Setting redundant mappings will not have any side effects. The mappings are used to check the validity of the device during transport. Only after a new process joins the communication group will it try to create a channel based on these maps.

Node_0 device_maps:
("Node_1", {0: 0}), ("Node_2", {0: 0}), ...., ("Node_99", {0: 0})

Node_1 device_maps:
("Node_0", {0: 0}), ("Node_2", {0: 0}), ...., ("Node_99", {0: 0})

At the same time, we still expose the --cuda-device-map interface, which is used to allow users to configure the topology between devices, torchrpc will follow user input.

Related Issue

TODO

Load balancing capability: in a time-heterogeneous RL task environment, each worker can run at full capacity.

Check List

[x] merge the latest version source branch/repo, and resolve all the conflicts
[x] pass style check
[ ] pass all the tests

Dec 25 '22 12:12 SolenoidWGT

Codecov Report

Merging #562 (3fa5319) into main (f798002) will decrease coverage by 1.20%. The diff coverage is 37.75%.

:exclamation: Current head 3fa5319 differs from pull request most recent head e32055b. Consider uploading reports for the commit e32055b to get more accurate results

@@            Coverage Diff             @@
##             main     #562      +/-   ##
==========================================
- Coverage   83.60%   82.41%   -1.20%     
==========================================
  Files         565      571       +6     
  Lines       46375    47198     +823     
==========================================
+ Hits        38774    38900     +126     
- Misses       7601     8298     +697

Flag	Coverage Δ
unittests	`82.41% <37.75%> (-1.20%)`	:arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
ding/torch_utils/data_helper.py	`76.82% <0.00%> (-1.47%)`	:arrow_down:
ding/data/tests/test_shm_buffer.py	`20.58% <16.07%> (-29.42%)`	:arrow_down:
ding/framework/message_queue/perfs/perf_nng.py	`16.83% <16.83%> (ø)`
...ramework/message_queue/perfs/perf_torchrpc_nccl.py	`19.49% <19.49%> (ø)`
ding/framework/message_queue/perfs/perf_shm.py	`25.84% <25.84%> (ø)`
ding/framework/parallel.py	`66.21% <33.91%> (-19.22%)`	:arrow_down:
ding/data/shm_buffer.py	`60.19% <34.42%> (-37.59%)`	:arrow_down:
ding/utils/comm_perf_helper.py	`35.82% <35.82%> (ø)`
ding/envs/env_manager/subprocess_env_manager.py	`74.63% <38.46%> (-0.26%)`	:arrow_down:
ding/utils/lock_helper.py	`81.08% <41.17%> (-11.91%)`	:arrow_down:
... and 118 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Jan 12 '23 13:01 codecov[bot]