ACCL
ACCL copied to clipboard
Distributed emulation stuck with >= 12 ranks (2+ nodes)
I'm working on the integration of ACCL and OMPC. Currently now using ACCL distributed emulation approach to start testing offloading computation to Alveo boards in a distributed system using ACCL as the communication backend.
I've tried some scenarios:
- 4 nodes: Every time I go over 3 (including) emulated ACCL instances per node (12 in total), the application we run gets stuck after some time.
- 3 nodes: Every time I go over 4 (including) emulated ACCL instances per node (12 in total), the application we run gets stuck after some time.
- 2 nodes: Every time I go over 6 (including) emulated ACCL instances per node (12 in total), the application we run gets stuck after some time.
- 1 node (not distributing): Tested up to 20 ACCL instances, this works with no problem
Any scenario with 10 (or fewer) instances in total do work fine (can't test with 11 instances due to some integration constraints)