It did not converge on the ogbn-arxiv dataset.

Open JiaLonghao1997 opened this issue 1 year ago • 1 comments

When running on ogbn-arxiv, we found the following error.

20240528-09:53:39: output_dir: /public/home/jialh/metaHiC/models/01VQGraph/outputs/transductive/ogbn-arxiv/SAGE/seed_0
20240528-09:53:39: Total 169343 nodes.
20240528-09:53:39: Total 2501829 edges.
20240528-09:53:39: Load data with max_memory_allocated: 0.0000Gb | max_memory_cached: 0.0000Gb
20240528-09:53:39: conf: {'device': device(type='cuda', index=0), 'seed': 0, 'log_level': 20, 'console_log': True, 'output_path': 'outputs', 'num_exp': 1, 'exp_setting': 'tran', 'eval_interval': 1, 'save_results': False, 'dataset': 'ogbn-arxiv', 'data_path': './data', 'labelrate_train': 20, 'labelrate_val': 30, 'split_idx': 0, 'codebook_size': 32768, 'lamb_node': 0.001, 'lamb_edge': 0.03, 'model_config_path': '/public/home/jialh/metaHiC/models/01VQGraph/obgn_arxiv.conf.yaml', 'teacher': 'SAGE', 'num_layers': 2, 'hidden_dim': 256, 'dropout_ratio': 0.2, 'norm_type': 'batch', 'batch_size': 512, 'fan_out': '5,10', 'num_workers': 0, 'learning_rate': 0.01, 'weight_decay': 0, 'max_epoch': 100, 'patience': 50, 'feature_noise': 0, 'split_rate': 0.2, 'compute_min_cut': False, 'feature_aug_k': 0, 'output_dir': PosixPath('/public/home/jialh/metaHiC/models/01VQGraph/outputs/transductive/ogbn-arxiv/SAGE/seed_0'), 'feat_dim': 128, 'label_dim': 40, 'model_name': 'SAGE'}
Traceback (most recent call last):
  File "/public/home/jialh/metaHiC/models/01VQGraph/train_teacher.py", line 244, in <module>
    main()
  File "/public/home/jialh/metaHiC/models/01VQGraph/train_teacher.py", line 227, in main
    score = run(args)
  File "/public/home/jialh/metaHiC/models/01VQGraph/train_teacher.py", line 132, in run
    out, score_val, score_test, h_list, dist, codebook = run_transductive(
  File "/public/home/jialh/metaHiC/models/01VQGraph/train_and_eval.py", line 295, in run_transductive
    out, loss_train, score_train,  _, _, _ = evaluate(
  File "/public/home/jialh/metaHiC/models/01VQGraph/train_and_eval.py", line 148, in evaluate
    h_list, logits, _ , dist, codebook = model.inference(data, feats)
  File "/public/home/jialh/metaHiC/models/01VQGraph/models.py", line 519, in inference
    return self.encoder.inference(data, feats)
  File "/public/home/jialh/metaHiC/models/01VQGraph/models.py", line 243, in inference
    dist_all = torch.zeros(feats.shape[0], self.codebook_size, device=device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.67 GiB (GPU 0; 10.91 GiB total capacity; 101.55 MiB already allocated; 10.13 GiB free; 104.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

In order to avoid CUDA out of memory, we set batch_size=64, codebook_size=8192, num_layers=2, fan_out=5,10, and we found that it did not converge.

20240528-10:00:35: output_dir: /public/home/jialh/metaHiC/models/01VQGraph/outputs/transductive/ogbn-arxiv/SAGE/seed_0
20240528-10:00:35: Total 169343 nodes.
20240528-10:00:35: Total 2501829 edges.
20240528-10:00:35: Load data with max_memory_allocated: 0.0000Gb | max_memory_cached: 0.0000Gb
20240528-10:00:35: conf: {'device': device(type='cuda', index=0), 'seed': 0, 'log_level': 20, 'console_log': True, 'output_path': 'outputs', 'num_exp': 1, 'exp_setting': 'tran', 'eval_interval': 1, 'save_results': False, 'dataset': 'ogbn-arxiv', 'data_path': './data', 'labelrate_train': 20, 'labelrate_val': 30, 'split_idx': 0, 'codebook_size': 8192, 'lamb_node': 0.001, 'lamb_edge': 0.03, 'model_config_path': '/public/home/jialh/metaHiC/models/01VQGraph/obgn_arxiv.conf.yaml', 'teacher': 'SAGE', 'num_layers': 2, 'hidden_dim': 256, 'dropout_ratio': 0.2, 'norm_type': 'batch', 'batch_size': 64, 'fan_out': '5,10', 'num_workers': 0, 'learning_rate': 0.01, 'weight_decay': 0, 'max_epoch': 100, 'patience': 50, 'feature_noise': 0, 'split_rate': 0.2, 'compute_min_cut': False, 'feature_aug_k': 0, 'output_dir': PosixPath('/public/home/jialh/metaHiC/models/01VQGraph/outputs/transductive/ogbn-arxiv/SAGE/seed_0'), 'feat_dim': 128, 'label_dim': 40, 'model_name': 'SAGE'}
20240528-10:01:21: out.size(): torch.Size([169343, 40])
20240528-10:01:21: Ep   1 | max_memory_allocated: 8.4489Gb | loss: 2.4847 | s_train: 0.3918 | s_val: 0.4154 | s_test: 0.3865
20240528-10:02:03: out.size(): torch.Size([169343, 40])
20240528-10:02:03: Ep   2 | max_memory_allocated: 8.4784Gb | loss: 5.3593 | s_train: 0.3868 | s_val: 0.4248 | s_test: 0.4366
20240528-10:02:44: out.size(): torch.Size([169343, 40])
20240528-10:02:44: Ep   3 | max_memory_allocated: 8.4784Gb | loss: 9.8887 | s_train: 0.4054 | s_val: 0.4238 | s_test: 0.4109
20240528-10:03:26: out.size(): torch.Size([169343, 40])
20240528-10:03:26: Ep   4 | max_memory_allocated: 8.4784Gb | loss: 14.7743 | s_train: 0.4291 | s_val: 0.4472 | s_test: 0.4399
20240528-10:04:08: out.size(): torch.Size([169343, 40])
20240528-10:04:08: Ep   5 | max_memory_allocated: 8.4788Gb | loss: 19.6258 | s_train: 0.4261 | s_val: 0.4569 | s_test: 0.4425
20240528-10:04:50: out.size(): torch.Size([169343, 40])
20240528-10:04:50: Ep   6 | max_memory_allocated: 8.4788Gb | loss: 24.9095 | s_train: 0.4253 | s_val: 0.4276 | s_test: 0.4159
20240528-10:05:32: out.size(): torch.Size([169343, 40])
20240528-10:05:32: Ep   7 | max_memory_allocated: 8.4788Gb | loss: 30.3602 | s_train: 0.4224 | s_val: 0.4353 | s_test: 0.4223
20240528-10:06:13: out.size(): torch.Size([169343, 40])
20240528-10:06:13: Ep   8 | max_memory_allocated: 8.4788Gb | loss: 35.7189 | s_train: 0.4145 | s_val: 0.4387 | s_test: 0.4437
20240528-10:06:55: out.size(): torch.Size([169343, 40])
20240528-10:06:55: Ep   9 | max_memory_allocated: 8.4788Gb | loss: 41.0788 | s_train: 0.4183 | s_val: 0.4456 | s_test: 0.4395
20240528-10:07:37: out.size(): torch.Size([169343, 40])
20240528-10:07:37: Ep  10 | max_memory_allocated: 8.4788Gb | loss: 46.3140 | s_train: 0.4137 | s_val: 0.4240 | s_test: 0.4146
20240528-10:08:18: out.size(): torch.Size([169343, 40])
20240528-10:08:18: Ep  11 | max_memory_allocated: 8.4788Gb | loss: 51.6967 | s_train: 0.3891 | s_val: 0.3980 | s_test: 0.3784
20240528-10:09:00: out.size(): torch.Size([169343, 40])
20240528-10:09:00: Ep  12 | max_memory_allocated: 8.4788Gb | loss: 57.0279 | s_train: 0.4157 | s_val: 0.4148 | s_test: 0.4079
20240528-10:09:42: out.size(): torch.Size([169343, 40])
20240528-10:09:42: Ep  13 | max_memory_allocated: 8.4788Gb | loss: 62.2806 | s_train: 0.4147 | s_val: 0.4572 | s_test: 0.4607
20240528-10:10:24: out.size(): torch.Size([169343, 40])
20240528-10:10:24: Ep  14 | max_memory_allocated: 8.4788Gb | loss: 67.5761 | s_train: 0.4291 | s_val: 0.4422 | s_test: 0.4360
20240528-10:11:06: out.size(): torch.Size([169343, 40])
20240528-10:11:06: Ep  15 | max_memory_allocated: 8.4788Gb | loss: 73.1767 | s_train: 0.4107 | s_val: 0.4147 | s_test: 0.3931
20240528-10:11:48: out.size(): torch.Size([169343, 40])
20240528-10:11:48: Ep  16 | max_memory_allocated: 8.4788Gb | loss: 79.3345 | s_train: 0.4260 | s_val: 0.4328 | s_test: 0.4248
20240528-10:12:30: out.size(): torch.Size([169343, 40])
20240528-10:12:30: Ep  17 | max_memory_allocated: 8.4788Gb | loss: 86.1251 | s_train: 0.4152 | s_val: 0.4019 | s_test: 0.4046
20240528-10:13:12: out.size(): torch.Size([169343, 40])
20240528-10:13:12: Ep  18 | max_memory_allocated: 8.4788Gb | loss: 92.6365 | s_train: 0.4112 | s_val: 0.4315 | s_test: 0.4274
20240528-10:13:54: out.size(): torch.Size([169343, 40])
20240528-10:13:54: Ep  19 | max_memory_allocated: 8.4788Gb | loss: 99.6484 | s_train: 0.4001 | s_val: 0.3916 | s_test: 0.3596
20240528-10:14:36: out.size(): torch.Size([169343, 40])
20240528-10:14:36: Ep  20 | max_memory_allocated: 8.4788Gb | loss: 106.8252 | s_train: 0.3850 | s_val: 0.3665 | s_test: 0.3562
20240528-10:15:18: out.size(): torch.Size([169343, 40])
20240528-10:15:18: Ep  21 | max_memory_allocated: 8.4788Gb | loss: 115.3929 | s_train: 0.3980 | s_val: 0.3825 | s_test: 0.3524
20240528-10:16:00: out.size(): torch.Size([169343, 40])
20240528-10:16:00: Ep  22 | max_memory_allocated: 8.4788Gb | loss: 124.5834 | s_train: 0.4036 | s_val: 0.3981 | s_test: 0.4074
20240528-10:16:42: out.size(): torch.Size([169343, 40])
20240528-10:16:42: Ep  23 | max_memory_allocated: 8.4788Gb | loss: 135.1810 | s_train: 0.4047 | s_val: 0.4172 | s_test: 0.4079
20240528-10:17:24: out.size(): torch.Size([169343, 40])
20240528-10:17:24: Ep  24 | max_memory_allocated: 8.4788Gb | loss: 147.3654 | s_train: 0.4042 | s_val: 0.4236 | s_test: 0.4325
20240528-10:18:05: out.size(): torch.Size([169343, 40])
20240528-10:18:05: Ep  25 | max_memory_allocated: 8.4788Gb | loss: 160.5010 | s_train: 0.4014 | s_val: 0.4227 | s_test: 0.3989
20240528-10:18:47: out.size(): torch.Size([169343, 40])
20240528-10:18:47: Ep  26 | max_memory_allocated: 8.4788Gb | loss: 175.0141 | s_train: 0.3878 | s_val: 0.3512 | s_test: 0.3322
20240528-10:19:30: out.size(): torch.Size([169343, 40])
20240528-10:19:30: Ep  27 | max_memory_allocated: 8.4788Gb | loss: 192.3251 | s_train: 0.3712 | s_val: 0.4209 | s_test: 0.4286
20240528-10:20:11: out.size(): torch.Size([169343, 40])
20240528-10:20:11: Ep  28 | max_memory_allocated: 8.4788Gb | loss: 212.6606 | s_train: 0.3490 | s_val: 0.3544 | s_test: 0.3758
20240528-10:20:54: out.size(): torch.Size([169343, 40])
20240528-10:20:54: Ep  29 | max_memory_allocated: 8.4788Gb | loss: 237.1151 | s_train: 0.3840 | s_val: 0.3876 | s_test: 0.3846
20240528-10:21:36: out.size(): torch.Size([169343, 40])
20240528-10:21:36: Ep  30 | max_memory_allocated: 8.4788Gb | loss: 265.4982 | s_train: 0.3654 | s_val: 0.3808 | s_test: 0.3745
20240528-10:22:18: out.size(): torch.Size([169343, 40])
20240528-10:22:18: Ep  31 | max_memory_allocated: 8.4788Gb | loss: 299.6145 | s_train: 0.3932 | s_val: 0.4149 | s_test: 0.4051
20240528-10:23:00: out.size(): torch.Size([169343, 40])
20240528-10:23:00: Ep  32 | max_memory_allocated: 8.4788Gb | loss: 343.6816 | s_train: 0.3656 | s_val: 0.3318 | s_test: 0.3026
20240528-10:23:42: out.size(): torch.Size([169343, 40])
20240528-10:23:42: Ep  33 | max_memory_allocated: 8.4788Gb | loss: 396.8048 | s_train: 0.3733 | s_val: 0.3808 | s_test: 0.3688
20240528-10:24:23: out.size(): torch.Size([169343, 40])
20240528-10:24:23: Ep  34 | max_memory_allocated: 8.4788Gb | loss: 461.6037 | s_train: 0.3869 | s_val: 0.4067 | s_test: 0.3970
20240528-10:25:05: out.size(): torch.Size([169343, 40])
20240528-10:25:05: Ep  35 | max_memory_allocated: 8.4788Gb | loss: 541.3137 | s_train: 0.3899 | s_val: 0.4165 | s_test: 0.4132
20240528-10:25:47: out.size(): torch.Size([169343, 40])
20240528-10:25:47: Ep  36 | max_memory_allocated: 8.4788Gb | loss: 642.1798 | s_train: 0.4000 | s_val: 0.4126 | s_test: 0.3850
20240528-10:26:29: out.size(): torch.Size([169343, 40])
20240528-10:26:29: Ep  37 | max_memory_allocated: 8.4788Gb | loss: 762.5302 | s_train: 0.4127 | s_val: 0.4166 | s_test: 0.3943
20240528-10:27:11: out.size(): torch.Size([169343, 40])
20240528-10:27:11: Ep  38 | max_memory_allocated: 8.4788Gb | loss: 897.7273 | s_train: 0.3879 | s_val: 0.3960 | s_test: 0.3696
20240528-10:27:53: out.size(): torch.Size([169343, 40])
20240528-10:27:53: Ep  39 | max_memory_allocated: 8.4788Gb | loss: 1049.9378 | s_train: 0.3979 | s_val: 0.4002 | s_test: 0.3817
20240528-10:28:36: out.size(): torch.Size([169343, 40])
20240528-10:28:36: Ep  40 | max_memory_allocated: 8.4788Gb | loss: 1211.5399 | s_train: 0.4186 | s_val: 0.4271 | s_test: 0.4065
20240528-10:29:18: out.size(): torch.Size([169343, 40])

May 28 '24 02:05 JiaLonghao1997

Having the same issue on ogbn-arxiv graph tokenization training. The model cannot converge.

Dec 31 '24 15:12 hxu105