[WIP] Add APIs to export flow information
Define APIs that can be implemented by a dynamic plugin to export flow info.
Co-authored with Tom Emmons <tom@[redacted]>.
Export flow information for ring and tree topologies currently, covering both IPv4 and IPv6, as well as TCP and ROCEv2. We'll add support for P2P topology later.
The idea is that the flow information can be used by network devices to do better load-balancing.
Such a plugin will be used if the env var NCCL_FLOW_EXPORT is set to a non-zero value. An example plugin that simply writes the flow info to stdout can be seen here: https://github.com/sreeram-arista/ai-flow-lb
Currently still a WIP because:
- Need to flesh out unit-tests.
- Want to get feedback on the approach before going too deep.
Forgot to add: This patch contains significant contributions (most of it, in fact) from Tom Emmons.
Are you planning to add RoCE support in the same PR?
Are you planning to add RoCE support in the same PR?
Yes, just pushed changes to support ROCEv2 and IPv6 as well (credit: Tom Emmons).
Are you planning to add RoCE support in the same PR?
Yes, just pushed changes to support ROCEv2 and IPv6 as well (credit: Tom Emmons).
Are you planning to add any special test for this PR? It would be good if you can add some guidelines on how to test this PR on a system that has appropriate hardware support.
Are you planning to add any special test for this PR? It would be good if you can add some guidelines on how to test this PR on a system that has appropriate hardware support.
I don't have a multi-GPU system, so I wasn't able to unit-test this. I have asked for access to a multi-GPU cluster (AAC); once I get it, I'll be able to write some meaningful tests.
For now, I have added a stub test to show what the scaffolding might look like. It passes, since it doesn't actually do much:
$ UT_VERBOSE=1 ./build/release/test/rccl-UnitTests --gtest_filter=FlowExport.CommExport
================================================================================
Environment variables:
- UT_SHOW_NAMES Show test case names ( 1) <unset>
- UT_MIN_GPUS Minimum number of GPUs to use ( 2) <unset>
- UT_MAX_GPUS Maximum number of GPUs to use ( 0) <unset>
- UT_POW2_GPUS Only allow power-of-2 # of GPUs ( 0) <unset>
- UT_PROCESS_MASK Whether to run single/multi process ( 3) <unset>
- UT_VERBOSE Show verbose unit test output ( 1) 1
- UT_REDOPS List of reduction ops to test ( -1) <unset>
- UT_DATATYPES List of datatypes to test ( -1) <unset>
- UT_MAX_RANKS_PER_GPU Maximum number of ranks using the same GPU ( 1) <unset>
- UT_PRINT_VALUES Print array values (-1 for all) ( 0) <unset>
- UT_SHOW_TIMING Show timing table ( 1) <unset>
- UT_INTERACTIVE Run in interactive mode ( 0) <unset>
================================================================================
Note: Google Test filter = FlowExport.CommExport
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from FlowExport
[ RUN ] FlowExport.CommExport
[ INFO ] Detected 0 GPUs
[ OK ] FlowExport.CommExport (6 ms)
[----------] 1 test from FlowExport (7 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (7 ms total)
[ PASSED ] 1 test.
[ INFO ] Total executed cases: 0
[ TIMING ] TEST SUITE : TEST NAME : TIME ms (STATUS)
[ TIMING ] FlowExport : CommExport : 0.01 sec (PASS)
[ TIMING ] FlowExport : TOTAL : 0.01 sec (PASS)
[ TIMING ] Total time: 0.00 minutes
$ UT_MAX_GPUS=1 UT_VERBOSE=1 ./build/release/test/rccl-UnitTests --gtest_filter=FlowExport.CommExport
================================================================================
Environment variables:
- UT_SHOW_NAMES Show test case names ( 1) <unset>
- UT_MIN_GPUS Minimum number of GPUs to use ( 2) <unset>
- UT_MAX_GPUS Maximum number of GPUs to use ( 1) 1
- UT_POW2_GPUS Only allow power-of-2 # of GPUs ( 0) <unset>
- UT_PROCESS_MASK Whether to run single/multi process ( 3) <unset>
- UT_VERBOSE Show verbose unit test output ( 1) 1
- UT_REDOPS List of reduction ops to test ( -1) <unset>
- UT_DATATYPES List of datatypes to test ( -1) <unset>
- UT_MAX_RANKS_PER_GPU Maximum number of ranks using the same GPU ( 1) <unset>
- UT_PRINT_VALUES Print array values (-1 for all) ( 0) <unset>
- UT_SHOW_TIMING Show timing table ( 1) <unset>
- UT_INTERACTIVE Run in interactive mode ( 0) <unset>
================================================================================
Note: Google Test filter = FlowExport.CommExport
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from FlowExport
[ RUN ] FlowExport.CommExport
[ INFO ] Detected 1 GPUs
[ OK ] FlowExport.CommExport (7 ms)
[----------] 1 test from FlowExport (7 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (7 ms total)
[ PASSED ] 1 test.
[ INFO ] Total executed cases: 0
[ TIMING ] TEST SUITE : TEST NAME : TIME ms (STATUS)
[ TIMING ] FlowExport : CommExport : 0.01 sec (PASS)
[ TIMING ] FlowExport : TOTAL : 0.01 sec (PASS)
[ TIMING ] Total time: 0.00 minutes
I do have access to a couple of machines with 1 GPU each (as shown above). Is there a way to get the unit-tests to use more than one machine (e.g., using a hostfile a la MPI)?
Are you planning to add any special test for this PR? It would be good if you can add some guidelines on how to test this PR on a system that has appropriate hardware support.
I don't have a multi-GPU system, so I wasn't able to unit-test this. I have asked for access to a multi-GPU cluster (AAC); once I get it, I'll be able to write some meaningful tests.
For now, I have added a stub test to show what the scaffolding might look like. It passes, since it doesn't actually do much:
$ UT_VERBOSE=1 ./build/release/test/rccl-UnitTests --gtest_filter=FlowExport.CommExport ================================================================================ Environment variables: - UT_SHOW_NAMES Show test case names ( 1) <unset> - UT_MIN_GPUS Minimum number of GPUs to use ( 2) <unset> - UT_MAX_GPUS Maximum number of GPUs to use ( 0) <unset> - UT_POW2_GPUS Only allow power-of-2 # of GPUs ( 0) <unset> - UT_PROCESS_MASK Whether to run single/multi process ( 3) <unset> - UT_VERBOSE Show verbose unit test output ( 1) 1 - UT_REDOPS List of reduction ops to test ( -1) <unset> - UT_DATATYPES List of datatypes to test ( -1) <unset> - UT_MAX_RANKS_PER_GPU Maximum number of ranks using the same GPU ( 1) <unset> - UT_PRINT_VALUES Print array values (-1 for all) ( 0) <unset> - UT_SHOW_TIMING Show timing table ( 1) <unset> - UT_INTERACTIVE Run in interactive mode ( 0) <unset> ================================================================================ Note: Google Test filter = FlowExport.CommExport [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from FlowExport [ RUN ] FlowExport.CommExport [ INFO ] Detected 0 GPUs [ OK ] FlowExport.CommExport (6 ms) [----------] 1 test from FlowExport (7 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test suite ran. (7 ms total) [ PASSED ] 1 test. [ INFO ] Total executed cases: 0 [ TIMING ] TEST SUITE : TEST NAME : TIME ms (STATUS) [ TIMING ] FlowExport : CommExport : 0.01 sec (PASS) [ TIMING ] FlowExport : TOTAL : 0.01 sec (PASS) [ TIMING ] Total time: 0.00 minutes $ UT_MAX_GPUS=1 UT_VERBOSE=1 ./build/release/test/rccl-UnitTests --gtest_filter=FlowExport.CommExport ================================================================================ Environment variables: - UT_SHOW_NAMES Show test case names ( 1) <unset> - UT_MIN_GPUS Minimum number of GPUs to use ( 2) <unset> - UT_MAX_GPUS Maximum number of GPUs to use ( 1) 1 - UT_POW2_GPUS Only allow power-of-2 # of GPUs ( 0) <unset> - UT_PROCESS_MASK Whether to run single/multi process ( 3) <unset> - UT_VERBOSE Show verbose unit test output ( 1) 1 - UT_REDOPS List of reduction ops to test ( -1) <unset> - UT_DATATYPES List of datatypes to test ( -1) <unset> - UT_MAX_RANKS_PER_GPU Maximum number of ranks using the same GPU ( 1) <unset> - UT_PRINT_VALUES Print array values (-1 for all) ( 0) <unset> - UT_SHOW_TIMING Show timing table ( 1) <unset> - UT_INTERACTIVE Run in interactive mode ( 0) <unset> ================================================================================ Note: Google Test filter = FlowExport.CommExport [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from FlowExport [ RUN ] FlowExport.CommExport [ INFO ] Detected 1 GPUs [ OK ] FlowExport.CommExport (7 ms) [----------] 1 test from FlowExport (7 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test suite ran. (7 ms total) [ PASSED ] 1 test. [ INFO ] Total executed cases: 0 [ TIMING ] TEST SUITE : TEST NAME : TIME ms (STATUS) [ TIMING ] FlowExport : CommExport : 0.01 sec (PASS) [ TIMING ] FlowExport : TOTAL : 0.01 sec (PASS) [ TIMING ] Total time: 0.00 minutesI do have access to a couple of machines with 1 GPU each (as shown above). Is there a way to get the unit-tests to use more than one machine (e.g., using a hostfile a la MPI)?
I don't think so. @wenkaidu - any comments?
I see many test failures now that you have added the new commits.
I see many test failures now that you have added the new commits.
I don't have access to the test logs. Could you retrieve a couple and send them to me?
I see many test failures now that you have added the new commits.
I don't have access to the test logs. Could you retrieve a couple and send them to me?
I see the following in the log: `[ RUN ] FlowExport.CommExport
06106a6eaa4c:51110:51110 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
06106a6eaa4c:51110:51110 [0] NCCL INFO NET/Plugin : Plugin load (librccl-net.so) returned 0 : librccl-net.so: cannot open shared object file: No such file or directory
06106a6eaa4c:51110:51110 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
06106a6eaa4c:51110:51110 [0] NCCL INFO flow export disabled
06106a6eaa4c:51110:51110 [0] NCCL INFO Kernel version: 5.9.1-amdsos-build32-1+
RCCL version 2.18.3+hip5.7 HEAD:8eabd2a
[ ERROR ] Child 0 pipe closed unexpectedly`
I see many test failures now that you have added the new commits.
I don't have access to the test logs. Could you retrieve a couple and send them to me?
I see the following in the log: `[ RUN ] FlowExport.CommExport
06106a6eaa4c:51110:51110 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
06106a6eaa4c:51110:51110 [0] NCCL INFO NET/Plugin : Plugin load (librccl-net.so) returned 0 : librccl-net.so: cannot open shared object file: No such file or directory
06106a6eaa4c:51110:51110 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
06106a6eaa4c:51110:51110 [0] NCCL INFO flow export disabled
06106a6eaa4c:51110:51110 [0] NCCL INFO Kernel version: 5.9.1-amdsos-build32-1+
RCCL version 2.18.3+hip5.7 HEAD:8eabd2a
[ ERROR ] Child 0 pipe closed unexpectedly`
Thanks. I've pushed a commit that I think will fix it. Let's see what the CI shows.
@sreeram-arista Is this something you are still working on? It is still tagged as WIP and there has been no movement in a while.
@sreeram-arista Is this something you are still working on? It is still tagged as WIP and there has been no movement in a while.
I've discarded this PR. I haven't worked on it since it was posted, and I don't think I'll be able to follow up on it in the foreseeable future.