dgl icon indicating copy to clipboard operation
dgl copied to clipboard

[EXAMPLE]Add multi gpu graph predication GIN+virtualnode example

Open peizhou001 opened this issue 3 years ago • 9 comments

Description

Add a new multi node graph predication example with GIN+virtualnode model and OGB dataset.

Checklist

Please feel free to remove inapplicable items for your PR.

  • [x] The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
  • [ ] Changes are complete (i.e. I finished coding on this PR)
  • [ ] All changes have test coverage
  • [x] Code is well-documented
  • [ ] To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
  • [ ] Related issue is referred in this PR
  • [ ] If the PR is for a new model/paper, I've updated the example index here.

Changes

peizhou001 avatar Aug 11 '22 03:08 peizhou001

To trigger regression tests:

  • @dgl-bot run [instance-type] [which tests] [compare-with-branch]; For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

dgl-bot avatar Aug 11 '22 03:08 dgl-bot

Commit ID: 53cfcd1cd23c7263f2b5beae82db057afeb36e67

Build ID: 1

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

dgl-bot avatar Aug 11 '22 04:08 dgl-bot

Apart from the comments above, how is the scalability of the code w.r.t. the number of GPUs in use?

jermainewang avatar Aug 11 '22 08:08 jermainewang

Should this go to examples/pytorch/gin or examples/pytorch/ogb?

mufeili avatar Aug 12 '22 05:08 mufeili

Should this go to examples/pytorch/gin or examples/pytorch/ogb?

Depending on the purpose of this example:

  • Is this for teaching users how to write GIN? Yes and no, but more no IMO because (1) there is already a script and (2) it is a variant of the original GIN model in the paper.
  • Is this for reproducing OGB leaderboard? No. The performance is not carefully tuned.
  • Is this for teaching users how to write multi-GPU graph prediction? Yes, though we picked a model based on one of the OGB baseline, GIN + virtual node.

Therefore, I think putting the example either in examples/pytorch/gin or examples/pytorch/ogb will make it hard for users to find them. I think we probably should create a folder examples/pytorch/multigpu/ and put the example there. We may also want to put the multigpu GraphSAGE example there too. Similarly, distributed training should have its own folder.

cc @chang-l @BarclayII @TristonC for opinions.

jermainewang avatar Aug 14 '22 06:08 jermainewang

peizhou001 avatar Aug 15 '22 02:08 peizhou001

Apart from the comments above, how is the scalability of the code w.r.t. the number of GPUs in use?

The scalability performs well when GPU number is under 4, but the time doesn't decrease and even increase sightly when adding GPU number from 4 to 8, I'm trying to understand why it happens. And this is the averaged data after dozens of runs:

Ogbg-molhiv

GPU number Acceleration ratio
1 x
2 2.1x
4 3.2x
8 3.1x

Ogbg-molpcba

GPU number Acceleration ratio
1 x
2 2.2x
4 3.5x
8 3.3x

As we talked, the poor performance after GPU number > 4 maybe caused by too small batch size, this is the new test result which make sure each GPU fed with 32 samples in every batch run, and the it proves our inference:

GPU number Speed Up Batch size Test accuracy Average epoch Time(seconds)
1 x 32 0.7765 45.0
2 3.7x 64 0.7761 12.1
4 5.9x 128 0.7854 7.6
8 9.5x 256 0.0.7751 4.7

peizhou001 avatar Aug 15 '22 02:08 peizhou001

Commit ID: 3d7cf5cdfeacc929d848486ce4fd6a9fa4cf50ae

Build ID: 2

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

dgl-bot avatar Aug 15 '22 03:08 dgl-bot

Commit ID: c68aaa8cb906af9800fb868b275e24e8b1d6f1d7

Build ID: 3

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

dgl-bot avatar Aug 15 '22 04:08 dgl-bot

Therefore, I think putting the example either in examples/pytorch/gin or examples/pytorch/ogb will make it hard for users to find them. I think we probably should create a folder examples/pytorch/multigpu/ and put the example there. We may also want to put the multigpu GraphSAGE example there too. Similarly, distributed training should have its own folder.

I agree. Additionally, I think the multigpu folder needs to be under baseline example folder eventually if its for educational purposes, similar to distributed or other advanced folders.

chang-l avatar Aug 15 '22 17:08 chang-l

Commit ID: 795ef5d93b7bbe52e0351b1f6aef68efce7bb883

Build ID: 4

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

dgl-bot avatar Aug 16 '22 06:08 dgl-bot

Commit ID: 5d9039366e76a00433957126933268b75ac28ae5

Build ID: 5

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

dgl-bot avatar Aug 16 '22 07:08 dgl-bot

Commit ID: 942994062b6a6c465021a1682d6420837d3107ed

Build ID: 6

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

dgl-bot avatar Aug 16 '22 08:08 dgl-bot

Commit ID: a1d0857df4e22d048e6b4032ecc2f0bfbacf36aa

Build ID: 7

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

dgl-bot avatar Aug 18 '22 05:08 dgl-bot

Commit ID: 1acecc265035a863d5e1797c7e5e95c1b0130cc1

Build ID: 8

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

dgl-bot avatar Aug 18 '22 10:08 dgl-bot

Commit ID: 0c90d3790c2f5fb951a3a26d5cb1b572d7adc138

Build ID: 9

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

dgl-bot avatar Aug 18 '22 11:08 dgl-bot

Commit ID: a2e3c21d23d2c235f6219d71611d5471f9b2f224

Build ID: 10

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

dgl-bot avatar Aug 18 '22 15:08 dgl-bot

Commit ID: c4a2646dce59354d7abd2465b9303e8b3ab83e43

Build ID: 11

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

dgl-bot avatar Aug 18 '22 16:08 dgl-bot