IGB-Datasets How to reproduce accuracy results for IGB-large

I am unable to reproduce the accuracy results in the paper for IGB-large+SAGE model. I got 60~61% validation and test accuracy after 3 epochs, compared to 64.89% in the paper.

To Reproduce I downloaded the dataset with the provided bash script. I ran results/IGB_large/gnn.py instead of the default train_single_gpu.py, since the logs under results folder align with results reported in Table 9 in the paper.

I made small changes to gnn.py. The code is attached below, if needed. gnn.py.zip. gnn.py doesn't use the dataloader from IGB package but directly mmap the hard-coded dataset files. I changed these paths to corresponding locations on my system.

Training logs

Dataset_size: large
Model       : sage
Num_classes : 19

Graph(num_nodes=100000000, num_edges=1323571364,
      ndata_schemes={'feat': Scheme(shape=(1024,), dtype=torch.float32), 'label': Scheme(shape=(), dtype=torch.int64), 'train_mask': Scheme(shape=(), dtype=torch.bool), 'val_mask': Scheme(shape=(), dtype=torch.bool), 'test_mask': Scheme(shape=(), dtype=torch.bool)}
      edata_schemes={})
Epoch 0: 100%|████████████████████████████████████████████████████| 1832/1832 [11:15<00:00,  2.71it/s, Train Acc=41.88%]
Layer 0: 100%|██████████████████████████████████████████████████████████████████████| 3052/3052 [11:47<00:00,  4.31it/s]
Layer 1: 100%|██████████████████████████████████████████████████████████████████████| 3052/3052 [04:37<00:00, 10.99it/s]
Epoch 00000 | Loss 3520.9861 | Train Acc 0.4249 | Val Acc 0.5988 | Test Acc 0.5992 | Time 1666.04s | GPU 10142107.1 MB
Epoch 1: 100%|████████████████████████████████████████████████████| 1832/1832 [11:23<00:00,  2.68it/s, Train Acc=42.75%]
Layer 0: 100%|██████████████████████████████████████████████████████████████████████| 3052/3052 [10:25<00:00,  4.88it/s]
Layer 1: 100%|██████████████████████████████████████████████████████████████████████| 3052/3052 [04:43<00:00, 10.75it/s]
Epoch 00001 | Loss 3446.5244 | Train Acc 0.4264 | Val Acc 0.6018 | Test Acc 0.6022 | Time 1598.90s | GPU 10159558.0 MB
Epoch 2: 100%|████████████████████████████████████████████████████| 1832/1832 [11:13<00:00,  2.72it/s, Train Acc=42.06%]
Layer 0: 100%|██████████████████████████████████████████████████████████████████████| 3052/3052 [10:33<00:00,  4.81it/s]
Layer 1: 100%|██████████████████████████████████████████████████████████████████████| 3052/3052 [04:40<00:00, 10.87it/s]
Epoch 00002 | Loss 3439.4326 | Train Acc 0.4266 | Val Acc 0.6021 | Test Acc 0.6025 | Time 1593.71s | GPU 10159558.0 MB

Total time taken:  4858.714913845062
Train accuracy: 0.43 ± 0.00 	 Best: 42.6640%
Test accuracy: 0.60 ± 0.00 	 Best: 60.2522%

 -------- For debugging --------- 
Parameters:  Namespace(path='/mnt/nvme14/IGB260M/', modelpath='gsage_19.pt', dataset_size='large', num_classes=19, hidden_channels=256, fan_out='5,10', num_layers=2, learning_rate=0.001, decay=0.0001, num_workers=16, batch_size=32768, dropout=0.2, epochs=3, model='sage', in_memory=0, device='0')
Graph(num_nodes=100000000, num_edges=1323571364,
      ndata_schemes={'feat': Scheme(shape=(1024,), dtype=torch.float32), 'label': Scheme(shape=(), dtype=torch.int64), 'train_mask': Scheme(shape=(), dtype=torch.bool), 'val_mask': Scheme(shape=(), dtype=torch.bool), 'test_mask': Scheme(shape=(), dtype=torch.bool), 'features': Scheme(shape=(1024,), dtype=torch.float32), 'labels': Scheme(shape=(), dtype=torch.int64)}
      edata_schemes={})
Train accuracy:  [0.42486560344696045, 0.426390141248703, 0.4266396462917328]
Test accuracy:  [0.5992226600646973, 0.602225661277771, 0.6025221347808838]

Expected behavior I expect the final accuracy to be similar to those reported in Table 9, and the accuracy results per epoch to be similar to results/IGB_large/results/sage/gsage_19.txt.

Nov 19 '23 21:11 hirayaku

The hyper parameters do make a noticeable difference in performance. I would recommend testing with different hyper parameters (batch size, hidden param). We ran these with some parameters so it possible to get +- 4-5% performance with hyperparameter tuning.

Nov 19 '23 21:11 akhatua2

We have team members within our group and externally who have successfully reproduced the training accuracy on the sage model with different setups and hardware. Until we fully determine if its an issue from our side, we retain this issue as a question. Please try out the suggestion discussed by @akhatua2 and let us know.

Nov 19 '23 21:11 msharmavikram

May I ask which set of hyper-parameters you used to achieve the reported accuracy? Since the dataset is gigantic and training is long, it might be easier for me to follow that setting first rather than doing a grid search.

Nov 19 '23 21:11 hirayaku

Do we generally need to tune hyper-parameters except hidden dimension and batch sizes to obtain the results in the paper?

I ran some tests with smaller datasets (SAGE for IGB-small with train_single_gpu.py), but can't achieve the accuracy level. I got 73.2~73.6% depending on hidden dimensions (128~1024) while it should be 75.49%. I didn't tune batch sizes because the paper states that batch size used is 10K.

I noticed train_single_gpu.py and scripts under results folder does inference differently. Which one did the paper use?

@akhatua2 @msharmavikram

Nov 20 '23 02:11 hirayaku