ACCL icon indicating copy to clipboard operation
ACCL copied to clipboard

Distributed emulation stuck with >= 12 ranks (2+ nodes)

Open PedrooHR opened this issue 3 years ago • 0 comments

I'm working on the integration of ACCL and OMPC. Currently now using ACCL distributed emulation approach to start testing offloading computation to Alveo boards in a distributed system using ACCL as the communication backend.

I've tried some scenarios:

  • 4 nodes: Every time I go over 3 (including) emulated ACCL instances per node (12 in total), the application we run gets stuck after some time.
  • 3 nodes: Every time I go over 4 (including) emulated ACCL instances per node (12 in total), the application we run gets stuck after some time.
  • 2 nodes: Every time I go over 6 (including) emulated ACCL instances per node (12 in total), the application we run gets stuck after some time.
  • 1 node (not distributing): Tested up to 20 ACCL instances, this works with no problem

Any scenario with 10 (or fewer) instances in total do work fine (can't test with 11 instances due to some integration constraints)

PedrooHR avatar Nov 04 '22 17:11 PedrooHR