Mingbang Wang
Mingbang Wang
@Rhett-Ying Benchmark shows that the variation on performance is acceptable. I'am trying to find out a way to enable all replicas to obtain a random seed from the main process...
benchmark on `/dgl/examples/multigpu/graphbolt/node_classification.py`: ## ogbn-products Old: ```txt $ python /home/ubuntu/dgl/examples/multigpu/graphbolt/node_classification.py --gpu 0,1,2,3 Training with 4 gpus. The dataset is already preprocessed. Training... 48it [00:02, 16.06it/s] Validating... 10it [00:00, 21.67it/s] Epoch...
Tested on g4dn.metal.
@Rhett-Ying The issue of random seed has been resolved. What a relief that torch.distributed has convenient communicating APIs.
> This POC proves to work well both on correctness and performance. Now it's time to finalize the code change. > > 1. Is it possible to update existing `ItemSampler`...
Could you please paste the error message here?
> The fact that runtime performance is unchanged is good. However, to verify whether the old or the new implementation is more performant, we need to track the CPU utilization....
@mfbalin Thank you for your valuable suggestions, but considering the scope of this PR, I'd like to defer them to another PR. @Rhett-Ying If everything looks good to you, please...
Close because the changes are made in #7394 #7408 #7424 #7430 。
According to https://www.dgl.ai/pages/start.html, current (2024/05/11) supported Python versions are `3.8, 3.9, 3.10, 3.11, 3.12`. If one day python3.8 is no longer supported, we need to remove all imports of `typing`...