[EXAMPLE]Add multi gpu graph predication GIN+virtualnode example
Description
Add a new multi node graph predication example with GIN+virtualnode model and OGB dataset.
Checklist
Please feel free to remove inapplicable items for your PR.
- [x] The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
- [ ] Changes are complete (i.e. I finished coding on this PR)
- [ ] All changes have test coverage
- [x] Code is well-documented
- [ ] To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
- [ ] Related issue is referred in this PR
- [ ] If the PR is for a new model/paper, I've updated the example index here.
Changes
To trigger regression tests:
-
@dgl-bot run [instance-type] [which tests] [compare-with-branch]; For example:@dgl-bot run g4dn.4xlarge all dmlc/masteror@dgl-bot run c5.9xlarge kernel,api dmlc/master
Commit ID: 53cfcd1cd23c7263f2b5beae82db057afeb36e67
Build ID: 1
Status: ✅ CI test succeeded
Report path: link
Full logs path: link
Apart from the comments above, how is the scalability of the code w.r.t. the number of GPUs in use?
Should this go to examples/pytorch/gin or examples/pytorch/ogb?
Should this go to
examples/pytorch/ginorexamples/pytorch/ogb?
Depending on the purpose of this example:
- Is this for teaching users how to write GIN? Yes and no, but more no IMO because (1) there is already a script and (2) it is a variant of the original GIN model in the paper.
- Is this for reproducing OGB leaderboard? No. The performance is not carefully tuned.
- Is this for teaching users how to write multi-GPU graph prediction? Yes, though we picked a model based on one of the OGB baseline, GIN + virtual node.
Therefore, I think putting the example either in examples/pytorch/gin or examples/pytorch/ogb will make it hard for users to find them. I think we probably should create a folder examples/pytorch/multigpu/ and put the example there. We may also want to put the multigpu GraphSAGE example there too. Similarly, distributed training should have its own folder.
cc @chang-l @BarclayII @TristonC for opinions.
Apart from the comments above, how is the scalability of the code w.r.t. the number of GPUs in use?
The scalability performs well when GPU number is under 4, but the time doesn't decrease and even increase sightly when adding GPU number from 4 to 8, I'm trying to understand why it happens. And this is the averaged data after dozens of runs:
Ogbg-molhiv
| GPU number | Acceleration ratio |
|---|---|
| 1 | x |
| 2 | 2.1x |
| 4 | 3.2x |
| 8 | 3.1x |
Ogbg-molpcba
| GPU number | Acceleration ratio |
|---|---|
| 1 | x |
| 2 | 2.2x |
| 4 | 3.5x |
| 8 | 3.3x |
As we talked, the poor performance after GPU number > 4 maybe caused by too small batch size, this is the new test result which make sure each GPU fed with 32 samples in every batch run, and the it proves our inference:
| GPU number | Speed Up | Batch size | Test accuracy | Average epoch Time(seconds) |
|---|---|---|---|---|
| 1 | x | 32 | 0.7765 | 45.0 |
| 2 | 3.7x | 64 | 0.7761 | 12.1 |
| 4 | 5.9x | 128 | 0.7854 | 7.6 |
| 8 | 9.5x | 256 | 0.0.7751 | 4.7 |
Commit ID: 3d7cf5cdfeacc929d848486ce4fd6a9fa4cf50ae
Build ID: 2
Status: ✅ CI test succeeded
Report path: link
Full logs path: link
Commit ID: c68aaa8cb906af9800fb868b275e24e8b1d6f1d7
Build ID: 3
Status: ✅ CI test succeeded
Report path: link
Full logs path: link
Therefore, I think putting the example either in examples/pytorch/gin or examples/pytorch/ogb will make it hard for users to find them. I think we probably should create a folder examples/pytorch/multigpu/ and put the example there. We may also want to put the multigpu GraphSAGE example there too. Similarly, distributed training should have its own folder.
I agree. Additionally, I think the multigpu folder needs to be under baseline example folder eventually if its for educational purposes, similar to distributed or other advanced folders.
Commit ID: 795ef5d93b7bbe52e0351b1f6aef68efce7bb883
Build ID: 4
Status: ✅ CI test succeeded
Report path: link
Full logs path: link
Commit ID: 5d9039366e76a00433957126933268b75ac28ae5
Build ID: 5
Status: ✅ CI test succeeded
Report path: link
Full logs path: link
Commit ID: 942994062b6a6c465021a1682d6420837d3107ed
Build ID: 6
Status: ✅ CI test succeeded
Report path: link
Full logs path: link
Commit ID: a1d0857df4e22d048e6b4032ecc2f0bfbacf36aa
Build ID: 7
Status: ✅ CI test succeeded
Report path: link
Full logs path: link
Commit ID: 1acecc265035a863d5e1797c7e5e95c1b0130cc1
Build ID: 8
Status: ✅ CI test succeeded
Report path: link
Full logs path: link
Commit ID: 0c90d3790c2f5fb951a3a26d5cb1b572d7adc138
Build ID: 9
Status: ✅ CI test succeeded
Report path: link
Full logs path: link