torch.cuda.OutOfMemoryError: CUDA out of memory

Open JiaLonghao1997 opened this issue 1 year ago • 1 comments

我们注意到，您在论文中使用了ogbn-Arxiv和ogbn-Papers100M等大型数据集，但是在我们自己的服务器上测试时，发现内存超了。请问您ogbn-Arxiv数据集上训练过程中，内存消耗大概是多少？关于运行GraphMAE2或者处理百万节点级别的大规模图，您有什么建议吗？

(graphmb) [jialh@gpu07 GraphMAE2]$ sh 01run_ogbn-arxiv.sh
2024-04-22 21:08:20,721 - INFO - ----- Using best configs from configs/ogbn-arxiv.yaml -----
Namespace(seeds=[0], dataset='ogbn-arxiv', device=0, max_epoch=60, warmup_steps=-1, num_heads=8, num_out_heads=1, num_layers=4, num_dec_layers=1, num_remasking=3, num_hidden=1024, residual=True, in_drop=0.2, attn_drop=0.1, norm='layernorm', lr=0.0025, weight_decay=0.06, negative_slope=0.2, activation='prelu', mask_rate=0.5, remask_rate=0.5, remask_method='random', mask_type='mask', mask_method='random', drop_edge_rate=0.5, drop_edge_rate_f=0.0, encoder='gat', decoder='gat', loss_fn='sce', alpha_l=6, optimizer='adamw', max_epoch_f=1000, lr_f=0.005, weight_decay_f=0.0001, linear_prob=True, no_pretrain=False, load_model=False, checkpoint_path=None, use_cfg=True, logging=False, scheduler=True, batch_size=512, batch_size_f=256, sampling_method='lc', label_rate=1.0, ego_graph_file_path='./lc_ego_graphs/ogbn-arxiv-lc-ego-graphs-256.pt', data_dir='./dataset', lam=10.0, full_graph_forward=False, delayed_ema_epoch=40, replace_rate=0.0, momentum=0.996)
2024-04-22 21:08:21,362 - INFO - Before loading data, occupied memory: 353.75 MB
2024-04-22 21:08:21,362 - INFO - ego_graph_file_path: ./lc_ego_graphs/ogbn-arxiv-lc-ego-graphs-256.pt
2024-04-22 21:08:21,678 - INFO - --- to undirected graph ---
2024-04-22 21:08:22,452 - INFO - ### scaling features ###
2024-04-22 21:08:25,297 - INFO - After loading data, occupied memory: 968.62 MB
=== Use sce_loss and alpha_l=6 ===
num_encoder_params: 3428356, num_decoder_params: 131456, num_params_in_total: 6184710
2024-04-22 21:08:25,420 - INFO - ---- start pretraining ----
2024-04-22 21:08:25,420 - INFO - start training..
2024-04-22 21:08:26,714 - INFO - After creating dataloader: Memory: 1856.79 MB
2024-04-22 21:08:26,715 - INFO - Use scheduler
  0%|                                                                                                                                        | 0/331 [00:12<?, ?it/s]
Traceback (most recent call last):
  File "/public/home/jialh/metaHiC/models/GraphMAE2/main_large.py", line 222, in <module>
    model = pretrain(model, feats, graph, pretrain_ego_graph_nodes, max_epoch=max_epoch,
  File "/public/home/jialh/metaHiC/models/GraphMAE2/main_large.py", line 120, in pretrain
    loss = model(batch_g, x, targets, epoch, drop_g1, drop_g2)
  File "/home1/jialh/tools/anaconda3/envs/mamba/envs/graphmb/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/public/home/jialh/metaHiC/models/GraphMAE2/models/edcoder.py", line 231, in forward
    loss = self.mask_attr_prediction(g, x, targets, epoch, drop_g1, drop_g2)
  File "/public/home/jialh/metaHiC/models/GraphMAE2/models/edcoder.py", line 243, in mask_attr_prediction
    latent_target = self.encoder_ema(drop_g2, x,)
  File "/home1/jialh/tools/anaconda3/envs/mamba/envs/graphmb/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/public/home/jialh/metaHiC/models/GraphMAE2/models/gat.py", line 76, in forward
    h = self.gat_layers[l](g, h)
  File "/home1/jialh/tools/anaconda3/envs/mamba/envs/graphmb/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/public/home/jialh/metaHiC/models/GraphMAE2/models/gat.py", line 282, in forward
    rst = rst + self.bias.view(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 508.00 MiB (GPU 0; 10.91 GiB total capacity; 9.88 GiB already allocated; 52.06 MiB free; 10.19 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Apr 22 '24 13:04 JiaLonghao1997

感谢关注。

论文中的结果我们使用了80G的A100进行实验，但大部分实验应该可以在24G的显存下正常运行。我不太确定11G的显存能否运行，可以尝试调低batchsize，或者把模型变小试一试。

Apr 24 '24 06:04 yf-he