rccl icon indicating copy to clipboard operation
rccl copied to clipboard

[WIP] Add APIs to export flow information

Open sreeram-arista opened this issue 2 years ago • 10 comments

Define APIs that can be implemented by a dynamic plugin to export flow info.

Co-authored with Tom Emmons <tom@[redacted]>.

Export flow information for ring and tree topologies currently, covering both IPv4 and IPv6, as well as TCP and ROCEv2. We'll add support for P2P topology later.

The idea is that the flow information can be used by network devices to do better load-balancing.

Such a plugin will be used if the env var NCCL_FLOW_EXPORT is set to a non-zero value. An example plugin that simply writes the flow info to stdout can be seen here: https://github.com/sreeram-arista/ai-flow-lb

Currently still a WIP because:

  1. Need to flesh out unit-tests.
  2. Want to get feedback on the approach before going too deep.

sreeram-arista avatar Aug 09 '23 22:08 sreeram-arista

Forgot to add: This patch contains significant contributions (most of it, in fact) from Tom Emmons.

sreeram-arista avatar Aug 09 '23 23:08 sreeram-arista

Are you planning to add RoCE support in the same PR?

nusislam avatar Aug 16 '23 21:08 nusislam

Are you planning to add RoCE support in the same PR?

Yes, just pushed changes to support ROCEv2 and IPv6 as well (credit: Tom Emmons).

sreeram-arista avatar Aug 18 '23 18:08 sreeram-arista

Are you planning to add RoCE support in the same PR?

Yes, just pushed changes to support ROCEv2 and IPv6 as well (credit: Tom Emmons).

Are you planning to add any special test for this PR? It would be good if you can add some guidelines on how to test this PR on a system that has appropriate hardware support.

nusislam avatar Aug 22 '23 14:08 nusislam

Are you planning to add any special test for this PR? It would be good if you can add some guidelines on how to test this PR on a system that has appropriate hardware support.

I don't have a multi-GPU system, so I wasn't able to unit-test this. I have asked for access to a multi-GPU cluster (AAC); once I get it, I'll be able to write some meaningful tests.

For now, I have added a stub test to show what the scaffolding might look like. It passes, since it doesn't actually do much:

$ UT_VERBOSE=1 ./build/release/test/rccl-UnitTests --gtest_filter=FlowExport.CommExport
================================================================================
 Environment variables:
 - UT_SHOW_NAMES        Show test case names                       (  1) <unset>
 - UT_MIN_GPUS          Minimum number of GPUs to use              (  2) <unset>
 - UT_MAX_GPUS          Maximum number of GPUs to use              (  0) <unset>
 - UT_POW2_GPUS         Only allow power-of-2 # of GPUs            (  0) <unset>
 - UT_PROCESS_MASK      Whether to run single/multi process        (  3) <unset>
 - UT_VERBOSE           Show verbose unit test output              (  1) 1
 - UT_REDOPS            List of reduction ops to test              ( -1) <unset>
 - UT_DATATYPES         List of datatypes to test                  ( -1) <unset>
 - UT_MAX_RANKS_PER_GPU Maximum number of ranks using the same GPU (  1) <unset>
 - UT_PRINT_VALUES      Print array values (-1 for all)            (  0) <unset>
 - UT_SHOW_TIMING       Show timing table                          (  1) <unset>
 - UT_INTERACTIVE       Run in interactive mode                    (  0) <unset>
================================================================================
Note: Google Test filter = FlowExport.CommExport
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from FlowExport
[ RUN      ] FlowExport.CommExport
[ INFO     ] Detected 0 GPUs
[       OK ] FlowExport.CommExport (6 ms)
[----------] 1 test from FlowExport (7 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (7 ms total)
[  PASSED  ] 1 test.
[ INFO     ] Total executed cases: 0
[ TIMING   ] TEST SUITE          : TEST NAME           :       TIME ms (STATUS)
[ TIMING   ] FlowExport          : CommExport          :       0.01 sec (PASS)
[ TIMING   ] FlowExport          : TOTAL               :       0.01 sec (PASS)
[ TIMING   ] Total time:       0.00 minutes

$ UT_MAX_GPUS=1 UT_VERBOSE=1 ./build/release/test/rccl-UnitTests --gtest_filter=FlowExport.CommExport
================================================================================
 Environment variables:
 - UT_SHOW_NAMES        Show test case names                       (  1) <unset>
 - UT_MIN_GPUS          Minimum number of GPUs to use              (  2) <unset>
 - UT_MAX_GPUS          Maximum number of GPUs to use              (  1) 1
 - UT_POW2_GPUS         Only allow power-of-2 # of GPUs            (  0) <unset>
 - UT_PROCESS_MASK      Whether to run single/multi process        (  3) <unset>
 - UT_VERBOSE           Show verbose unit test output              (  1) 1
 - UT_REDOPS            List of reduction ops to test              ( -1) <unset>
 - UT_DATATYPES         List of datatypes to test                  ( -1) <unset>
 - UT_MAX_RANKS_PER_GPU Maximum number of ranks using the same GPU (  1) <unset>
 - UT_PRINT_VALUES      Print array values (-1 for all)            (  0) <unset>
 - UT_SHOW_TIMING       Show timing table                          (  1) <unset>
 - UT_INTERACTIVE       Run in interactive mode                    (  0) <unset>
================================================================================
Note: Google Test filter = FlowExport.CommExport
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from FlowExport
[ RUN      ] FlowExport.CommExport
[ INFO     ] Detected 1 GPUs
[       OK ] FlowExport.CommExport (7 ms)
[----------] 1 test from FlowExport (7 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (7 ms total)
[  PASSED  ] 1 test.
[ INFO     ] Total executed cases: 0
[ TIMING   ] TEST SUITE          : TEST NAME           :       TIME ms (STATUS)
[ TIMING   ] FlowExport          : CommExport          :       0.01 sec (PASS)
[ TIMING   ] FlowExport          : TOTAL               :       0.01 sec (PASS)
[ TIMING   ] Total time:       0.00 minutes

I do have access to a couple of machines with 1 GPU each (as shown above). Is there a way to get the unit-tests to use more than one machine (e.g., using a hostfile a la MPI)?

sreeram-arista avatar Aug 23 '23 15:08 sreeram-arista

Are you planning to add any special test for this PR? It would be good if you can add some guidelines on how to test this PR on a system that has appropriate hardware support.

I don't have a multi-GPU system, so I wasn't able to unit-test this. I have asked for access to a multi-GPU cluster (AAC); once I get it, I'll be able to write some meaningful tests.

For now, I have added a stub test to show what the scaffolding might look like. It passes, since it doesn't actually do much:

$ UT_VERBOSE=1 ./build/release/test/rccl-UnitTests --gtest_filter=FlowExport.CommExport
================================================================================
 Environment variables:
 - UT_SHOW_NAMES        Show test case names                       (  1) <unset>
 - UT_MIN_GPUS          Minimum number of GPUs to use              (  2) <unset>
 - UT_MAX_GPUS          Maximum number of GPUs to use              (  0) <unset>
 - UT_POW2_GPUS         Only allow power-of-2 # of GPUs            (  0) <unset>
 - UT_PROCESS_MASK      Whether to run single/multi process        (  3) <unset>
 - UT_VERBOSE           Show verbose unit test output              (  1) 1
 - UT_REDOPS            List of reduction ops to test              ( -1) <unset>
 - UT_DATATYPES         List of datatypes to test                  ( -1) <unset>
 - UT_MAX_RANKS_PER_GPU Maximum number of ranks using the same GPU (  1) <unset>
 - UT_PRINT_VALUES      Print array values (-1 for all)            (  0) <unset>
 - UT_SHOW_TIMING       Show timing table                          (  1) <unset>
 - UT_INTERACTIVE       Run in interactive mode                    (  0) <unset>
================================================================================
Note: Google Test filter = FlowExport.CommExport
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from FlowExport
[ RUN      ] FlowExport.CommExport
[ INFO     ] Detected 0 GPUs
[       OK ] FlowExport.CommExport (6 ms)
[----------] 1 test from FlowExport (7 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (7 ms total)
[  PASSED  ] 1 test.
[ INFO     ] Total executed cases: 0
[ TIMING   ] TEST SUITE          : TEST NAME           :       TIME ms (STATUS)
[ TIMING   ] FlowExport          : CommExport          :       0.01 sec (PASS)
[ TIMING   ] FlowExport          : TOTAL               :       0.01 sec (PASS)
[ TIMING   ] Total time:       0.00 minutes

$ UT_MAX_GPUS=1 UT_VERBOSE=1 ./build/release/test/rccl-UnitTests --gtest_filter=FlowExport.CommExport
================================================================================
 Environment variables:
 - UT_SHOW_NAMES        Show test case names                       (  1) <unset>
 - UT_MIN_GPUS          Minimum number of GPUs to use              (  2) <unset>
 - UT_MAX_GPUS          Maximum number of GPUs to use              (  1) 1
 - UT_POW2_GPUS         Only allow power-of-2 # of GPUs            (  0) <unset>
 - UT_PROCESS_MASK      Whether to run single/multi process        (  3) <unset>
 - UT_VERBOSE           Show verbose unit test output              (  1) 1
 - UT_REDOPS            List of reduction ops to test              ( -1) <unset>
 - UT_DATATYPES         List of datatypes to test                  ( -1) <unset>
 - UT_MAX_RANKS_PER_GPU Maximum number of ranks using the same GPU (  1) <unset>
 - UT_PRINT_VALUES      Print array values (-1 for all)            (  0) <unset>
 - UT_SHOW_TIMING       Show timing table                          (  1) <unset>
 - UT_INTERACTIVE       Run in interactive mode                    (  0) <unset>
================================================================================
Note: Google Test filter = FlowExport.CommExport
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from FlowExport
[ RUN      ] FlowExport.CommExport
[ INFO     ] Detected 1 GPUs
[       OK ] FlowExport.CommExport (7 ms)
[----------] 1 test from FlowExport (7 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (7 ms total)
[  PASSED  ] 1 test.
[ INFO     ] Total executed cases: 0
[ TIMING   ] TEST SUITE          : TEST NAME           :       TIME ms (STATUS)
[ TIMING   ] FlowExport          : CommExport          :       0.01 sec (PASS)
[ TIMING   ] FlowExport          : TOTAL               :       0.01 sec (PASS)
[ TIMING   ] Total time:       0.00 minutes

I do have access to a couple of machines with 1 GPU each (as shown above). Is there a way to get the unit-tests to use more than one machine (e.g., using a hostfile a la MPI)?

I don't think so. @wenkaidu - any comments?

nusislam avatar Aug 24 '23 18:08 nusislam

I see many test failures now that you have added the new commits.

nusislam avatar Aug 24 '23 18:08 nusislam

I see many test failures now that you have added the new commits.

I don't have access to the test logs. Could you retrieve a couple and send them to me?

sreeram-arista avatar Aug 24 '23 18:08 sreeram-arista

I see many test failures now that you have added the new commits.

I don't have access to the test logs. Could you retrieve a couple and send them to me?

I see the following in the log: `[ RUN ] FlowExport.CommExport

06106a6eaa4c:51110:51110 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>

06106a6eaa4c:51110:51110 [0] NCCL INFO NET/Plugin : Plugin load (librccl-net.so) returned 0 : librccl-net.so: cannot open shared object file: No such file or directory

06106a6eaa4c:51110:51110 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation

06106a6eaa4c:51110:51110 [0] NCCL INFO flow export disabled

06106a6eaa4c:51110:51110 [0] NCCL INFO Kernel version: 5.9.1-amdsos-build32-1+

RCCL version 2.18.3+hip5.7 HEAD:8eabd2a

[ ERROR ] Child 0 pipe closed unexpectedly`

nusislam avatar Aug 24 '23 18:08 nusislam

I see many test failures now that you have added the new commits.

I don't have access to the test logs. Could you retrieve a couple and send them to me?

I see the following in the log: `[ RUN ] FlowExport.CommExport

06106a6eaa4c:51110:51110 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>

06106a6eaa4c:51110:51110 [0] NCCL INFO NET/Plugin : Plugin load (librccl-net.so) returned 0 : librccl-net.so: cannot open shared object file: No such file or directory

06106a6eaa4c:51110:51110 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation

06106a6eaa4c:51110:51110 [0] NCCL INFO flow export disabled

06106a6eaa4c:51110:51110 [0] NCCL INFO Kernel version: 5.9.1-amdsos-build32-1+

RCCL version 2.18.3+hip5.7 HEAD:8eabd2a

[ ERROR ] Child 0 pipe closed unexpectedly`

Thanks. I've pushed a commit that I think will fix it. Let's see what the CI shows.

sreeram-arista avatar Aug 24 '23 18:08 sreeram-arista

@sreeram-arista Is this something you are still working on? It is still tagged as WIP and there has been no movement in a while.

akolliasAMD avatar Aug 12 '24 16:08 akolliasAMD

@sreeram-arista Is this something you are still working on? It is still tagged as WIP and there has been no movement in a while.

I've discarded this PR. I haven't worked on it since it was posted, and I don't think I'll be able to follow up on it in the foreseeable future.

sreeram-arista avatar Aug 12 '24 16:08 sreeram-arista