dgl Bus error(core dumped) in EdgeDataLoader of unsupervised graphsage example (20M edges)

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

In the example code: https://github.com/dmlc/dgl/blob/master/examples/pytorch/graphsage/advanced/train_sampling_unsupervised.py
Feed a dataset with 10M nodes and 20M edges
Bus error(core dumped) happened in EdgeDataLoader before training and DataLoader in inference.

Expected behavior

Environment

DGL Version (e.g., 1.0): 0.8.1cu101
Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): PyTorch 1.8.1+cu101
OS (e.g., Linux): Docker
How you installed DGL (conda, pip, source): pip
Build command you used (if compiling from source):
Python version: 3.6.13
CUDA/cuDNN version (if applicable): 101
GPU models and configuration (e.g. V100): V100
Any other relevant information:

Additional context

Aug 05 '22 05:08 blackboxo

The num of workers is set to 0 in EdgeDataLoader

Aug 05 '22 05:08 blackboxo

The bug didn't show when the number of train_seeds in EdgeDataLoader is small, like 1M

Aug 05 '22 05:08 blackboxo

How many GPUs did you use? Have you changed the args such as --graph-device, --data-device?

Aug 05 '22 07:08 yaox12

How many GPUs did you use? Have you changed the args such as --graph-device, --data-device?

one GPU. Nope. Have tried different args but got the same result.

Aug 05 '22 07:08 blackboxo

Can you try adding --shm-size=64g (large enough to store the whole graph) to your docker run command?

Aug 05 '22 07:08 yaox12

I can not change the docker command. I set the num of workers to 0 so I think it doesn't use shared memory.

Aug 05 '22 08:08 blackboxo

Can you try adding --shm-size=64g (large enough to store the whole graph) to your docker run command?

The docker environment is created automatically. I am not able to change the shm size.

Aug 05 '22 09:08 blackboxo

This line of code in dataloader will create a shared memory array for shuffling. https://github.com/dmlc/dgl/blob/5ba5106acab6a642e9b790e5331ee519112a5623/python/dgl/dataloading/dataloader.py#L146-L149 When len(train_seeds) > 8M, the shared tensor will run out of Docker's default shm size 64MB. @jermainewang @BarclayII

Aug 08 '22 05:08 yaox12

Can we avoid that when users do not intend to use shared memory?

Aug 08 '22 05:08 jermainewang

According to the comments, the shared tensor is used for persistent_workers=True (or num_workers > 0 I think?). We can change the code to use shared tensors only when these conditions hold. @BarclayII may know more details.

Aug 08 '22 05:08 yaox12

We can change the code to use shared tensors only when these conditions hold.

That sounds reasonable. I'm not sure how PyTorch shares the tensor to forked subprocesses though: if PyTorch uses shared memory then we are technically still copying the ID tensor into shared memory implicitly.

Aug 08 '22 06:08 BarclayII