dgl DGL Multi-gpu example CUDA Runtime Error

🐛 Bug

Run examples/multigpu/node_classification_sage.py --mode benchmark --gpu=0,1,2,3. The error I got:

Training in benchmark mode using 4 GPU(s)
Loading data
Training...
Epoch 00000 | Loss 2.2777 | Accuracy 0.7878 | Time 8.7002
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [1,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [2,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [3,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [4,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [5,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [6,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [7,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [8,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [9,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [10,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [11,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [12,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [13,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [14,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [15,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [16,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [17,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [18,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [19,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [20,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [21,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [22,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [23,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [24,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [25,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [26,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [27,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [28,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [29,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [30,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [31,0,0] Assertion `t >= 0 && t < n_classes` failed.
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa9d9f9e4d7 in /opt/conda/envs/dgl-dev-gpu-118/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fa9d9f6836b in /opt/conda/envs/dgl-dev-gpu-118/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fa9da03ab58 in /opt/conda/envs/dgl-dev-gpu-118/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1c36b (0x7fa9da00b36b in /opt/conda/envs/dgl-dev-gpu-118/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x2b930 (0x7fa9da01a930 in /opt/conda/envs/dgl-dev-gpu-118/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #5: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x2b (0x7fa99c9a9419 in /opt/conda/envs/dgl-dev-gpu-118/lib/python3.8/site-packages/dgl-1.2-py3.8-linux-x86_64.egg/dgl/tensoradapter/pytorch/libtensoradapter_pytorch_2.0.0.so)
frame #6: CUDARawDelete + 0x1c (0x7fa99c9a84a6 in /opt/conda/envs/dgl-dev-gpu-118/lib/python3.8/site-packages/dgl-1.2-py3.8-linux-x86_64.egg/dgl/tensoradapter/pytorch/libtensoradapter_pytorch_2.0.0.so)
frame #7: dgl::runtime::NDArray::Internal::DefaultDeleter(dgl::runtime::NDArray::Container*) + 0x25c (0x7fa9aea72afc in /opt/conda/envs/dgl-dev-gpu-118/lib/python3.8/site-packages/dgl-1.2-py3.8-linux-x86_64.egg/dgl/libdgl.so)
frame #8: dgl::UnitGraph::COO::~COO() + 0xff (0x7fa9aebdb4ef in /opt/conda/envs/dgl-dev-gpu-118/lib/python3.8/site-packages/dgl-1.2-py3.8-linux-x86_64.egg/dgl/libdgl.so)
frame #9: dgl::UnitGraph::~UnitGraph() + 0x130 (0x7fa9aebdafc0 in /opt/conda/envs/dgl-dev-gpu-118/lib/python3.8/site-packages/dgl-1.2-py3.8-linux-x86_64.egg/dgl/libdgl.so)
frame #10: dgl::HeteroGraph::~HeteroGraph() + 0xb5 (0x7fa9aea90fa5 in /opt/conda/envs/dgl-dev-gpu-118/lib/python3.8/site-packages/dgl-1.2-py3.8-linux-x86_64.egg/dgl/libdgl.so)
frame #11: DGLObjectFree + 0xc5 (0x7fa9aea50ff5 in /opt/conda/envs/dgl-dev-gpu-118/lib/python3.8/site-packages/dgl-1.2-py3.8-linux-x86_64.egg/dgl/libdgl.so)
frame #12: <unknown function> + 0x1171e (0x7fa99c98a71e in /opt/conda/envs/dgl-dev-gpu-118/lib/python3.8/site-packages/dgl-1.2-py3.8-linux-x86_64.egg/dgl/_ffi/_cy3/core.cpython-38-x86_64-linux-gnu.so)
frame #13: <unknown function> + 0x13b421 (0x560a4c17d421 in /opt/conda/envs/dgl-dev-gpu-118/bin/python)
frame #14: <unknown function> + 0x126ccb (0x560a4c168ccb in /opt/conda/envs/dgl-dev-gpu-118/bin/python)
frame #15: <unknown function> + 0x114b96 (0x560a4c156b96 in /opt/conda/envs/dgl-dev-gpu-118/bin/python)
frame #16: <unknown function> + 0x13b1cc (0x560a4c17d1cc in /opt/conda/envs/dgl-dev-gpu-118/bin/python)
frame #17: <unknown function> + 0x745ea5 (0x7faa40c2bea5 in /opt/conda/envs/dgl-dev-gpu-118/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #18: torch::autograd::deleteNode(torch::autograd::Node*) + 0x54 (0x7faa2ba53d64 in /opt/conda/envs/dgl-dev-gpu-118/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #19: std::_Sp_counted_deleter<torch::autograd::PyNode*, void (*)(torch::autograd::Node*), std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0xe (0x7faa40c26a0e in /opt/conda/envs/dgl-dev-gpu-118/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #20: torch::autograd::deleteNode(torch::autograd::Node*) + 0xa9 (0x7faa2ba53db9 in /opt/conda/envs/dgl-dev-gpu-118/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #21: std::_Sp_counted_deleter<torch::autograd::generated::AddBackward0*, void (*)(torch::autograd::Node*), std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0xe (0x7faa2b1a9d7e in /opt/conda/envs/dgl-dev-gpu-118/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #22: <unknown function> + 0x4ac2df0 (0x7faa2ba34df0 in /opt/conda/envs/dgl-dev-gpu-118/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #23: c10::TensorImpl::~TensorImpl() + 0x1b5 (0x7fa9d9f7c695 in /opt/conda/envs/dgl-dev-gpu-118/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #24: c10::TensorImpl::~TensorImpl() + 0x9 (0x7fa9d9f7c7b9 in /opt/conda/envs/dgl-dev-gpu-118/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #25: <unknown function> + 0x75acd8 (0x7faa40c40cd8 in /opt/conda/envs/dgl-dev-gpu-118/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #26: THPVariable_subclass_dealloc(_object*) + 0x325 (0x7faa40c41085 in /opt/conda/envs/dgl-dev-gpu-118/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #27: <unknown function> + 0x121dc8 (0x560a4c163dc8 in /opt/conda/envs/dgl-dev-gpu-118/bin/python)
frame #28: <unknown function> + 0x133068 (0x560a4c175068 in /opt/conda/envs/dgl-dev-gpu-118/bin/python)
frame #29: <unknown function> + 0x133051 (0x560a4c175051 in /opt/conda/envs/dgl-dev-gpu-118/bin/python)
frame #30: <unknown function> + 0x133051 (0x560a4c175051 in /opt/conda/envs/dgl-dev-gpu-118/bin/python)
frame #31: <unknown function> + 0x132cc3 (0x560a4c174cc3 in /opt/conda/envs/dgl-dev-gpu-118/bin/python)
frame #32: <unknown function> + 0x10f928 (0x560a4c151928 in /opt/conda/envs/dgl-dev-gpu-118/bin/python)
frame #33: <unknown function> + 0x148e23 (0x560a4c18ae23 in /opt/conda/envs/dgl-dev-gpu-118/bin/python)
frame #34: _PyEval_EvalFrameDefault + 0x2584 (0x560a4c15d874 in /opt/conda/envs/dgl-dev-gpu-118/bin/python)
frame #35: _PyEval_EvalCodeWithName + 0x2f1 (0x560a4c15a261 in /opt/conda/envs/dgl-dev-gpu-118/bin/python)
frame #36: _PyFunction_Vectorcall + 0x19c (0x560a4c16b89c in /opt/conda/envs/dgl-dev-gpu-118/bin/python)
frame #37: _PyEval_EvalFrameDefault + 0x6d5 (0x560a4c15b9c5 in /opt/conda/envs/dgl-dev-gpu-118/bin/python)
frame #38: _PyFunction_Vectorcall + 0x106 (0x560a4c16b806 in /opt/conda/envs/dgl-dev-gpu-118/bin/python)
frame #39: _PyEval_EvalFrameDefault + 0x3aa (0x560a4c15b69a in /opt/conda/envs/dgl-dev-gpu-118/bin/python)
frame #40: _PyEval_EvalCodeWithName + 0x2f1 (0x560a4c15a261 in /opt/conda/envs/dgl-dev-gpu-118/bin/python)
frame #41: _PyFunction_Vectorcall + 0x19c (0x560a4c16b89c in /opt/conda/envs/dgl-dev-gpu-118/bin/python)
frame #42: _PyEval_EvalFrameDefault + 0x11bb (0x560a4c15c4ab in /opt/conda/envs/dgl-dev-gpu-118/bin/python)
frame #43: _PyEval_EvalCodeWithName + 0x2f1 (0x560a4c15a261 in /opt/conda/envs/dgl-dev-gpu-118/bin/python)
frame #44: PyEval_EvalCodeEx + 0x39 (0x560a4c20cfd9 in /opt/conda/envs/dgl-dev-gpu-118/bin/python)
frame #45: PyEval_EvalCode + 0x1b (0x560a4c20cf9b in /opt/conda/envs/dgl-dev-gpu-118/bin/python)
frame #46: <unknown function> + 0x1eb929 (0x560a4c22d929 in /opt/conda/envs/dgl-dev-gpu-118/bin/python)
frame #47: <unknown function> + 0x1ea923 (0x560a4c22c923 in /opt/conda/envs/dgl-dev-gpu-118/bin/python)
frame #48: PyRun_StringFlags + 0x7d (0x560a4c22a30d in /opt/conda/envs/dgl-dev-gpu-118/bin/python)
frame #49: PyRun_SimpleStringFlags + 0x3d (0x560a4c0cff56 in /opt/conda/envs/dgl-dev-gpu-118/bin/python)
frame #50: Py_RunMain + 0x27e (0x560a4c2297fe in /opt/conda/envs/dgl-dev-gpu-118/bin/python)
frame #51: Py_BytesMain + 0x39 (0x560a4c2007b9 in /opt/conda/envs/dgl-dev-gpu-118/bin/python)
frame #52: __libc_start_main + 0xf3 (0x7faa71c50083 in /lib/x86_64-linux-gnu/libc.so.6)
frame #53: <unknown function> + 0x1be6bd (0x560a4c2006bd in /opt/conda/envs/dgl-dev-gpu-118/bin/python)


LIBXSMM_VERSION: main-1.17-3659 (25693771)
LIBXSMM_TARGET: clx [Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz]
Registry and code: 13 MB
Command: /opt/conda/envs/dgl-dev-gpu-118/bin/python -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=37, pipe_handle=67) --multiprocessing-fork 
Uptime: 16.470712 s
Traceback (most recent call last):
  File "error.py", line 380, in <module>
    mp.spawn(
  File "/opt/conda/envs/dgl-dev-gpu-118/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 239, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/opt/conda/envs/dgl-dev-gpu-118/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
    while not context.join():
  File "/opt/conda/envs/dgl-dev-gpu-118/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 3 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/envs/dgl-dev-gpu-118/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/ubuntu/Work/benchmark/error.py", line 288, in run
    train(
  File "/home/ubuntu/Work/benchmark/error.py", line 214, in train
    y_hat = model(blocks, x)
  File "/opt/conda/envs/dgl-dev-gpu-118/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/dgl-dev-gpu-118/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/opt/conda/envs/dgl-dev-gpu-118/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])  # type: ignore[index]
  File "/opt/conda/envs/dgl-dev-gpu-118/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/Work/benchmark/error.py", line 76, in forward
    h = layer(block, h)
  File "/opt/conda/envs/dgl-dev-gpu-118/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/dgl-dev-gpu-118/lib/python3.8/site-packages/dgl-1.2-py3.8-linux-x86_64.egg/dgl/nn/pytorch/conv/sageconv.py", line 237, in forward
    graph.update_all(msg_fn, fn.mean("m", "neigh"))
  File "/opt/conda/envs/dgl-dev-gpu-118/lib/python3.8/site-packages/dgl-1.2-py3.8-linux-x86_64.egg/dgl/heterograph.py", line 5110, in update_all
    ndata = core.message_passing(
  File "/opt/conda/envs/dgl-dev-gpu-118/lib/python3.8/site-packages/dgl-1.2-py3.8-linux-x86_64.egg/dgl/core.py", line 398, in message_passing
    ndata = invoke_gspmm(g, mfunc, rfunc)
  File "/opt/conda/envs/dgl-dev-gpu-118/lib/python3.8/site-packages/dgl-1.2-py3.8-linux-x86_64.egg/dgl/core.py", line 368, in invoke_gspmm
    z = op(graph, x)
  File "/opt/conda/envs/dgl-dev-gpu-118/lib/python3.8/site-packages/dgl-1.2-py3.8-linux-x86_64.egg/dgl/ops/spmm.py", line 215, in func
    return gspmm(g, "copy_lhs", reduce_op, x, None)
  File "/opt/conda/envs/dgl-dev-gpu-118/lib/python3.8/site-packages/dgl-1.2-py3.8-linux-x86_64.egg/dgl/ops/spmm.py", line 112, in gspmm
    deg = F.astype(F.clamp(deg, 1, max(g.num_edges(), 1)), F.dtype(ret))
  File "/opt/conda/envs/dgl-dev-gpu-118/lib/python3.8/site-packages/dgl-1.2-py3.8-linux-x86_64.egg/dgl/backend/pytorch/tensor.py", line 126, in astype
    return input.type(ty)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

The traceback of the python part is not accurate because CUDA kernel errors might be asynchronously reported at some other API call. But the error seems to occur when calculating loss inside cross_entropy().
I tried to check the y and y_hat and see if they are valid. But once I output them or assert them or check them, the error disappears.
If I set CUDA_LAUNCH_BLOCKING=1, there will be no error and the entire example will work fine.
If I comment out line 166&167 (prefetch_node_feats and prefetch_labels in NeighborSampler), there will be no error and the entire example will work fine.
Upgrading of torch version doesn't help.
Same problem occurs if use the history master branch in Sept. 2023.

To Reproduce

Run examples/multigpu/node_classification_sage.py --mode benchmark --gpu=0,1,2,3.

Expected behavior

The example should run without any error.

Environment

My environment:

python: 3.8.17
torch: 2.0.0+cu118
dgl: current master branch
DGL Version (e.g., 1.0): current master branch (nightly build)
Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3):
OS (e.g., Linux): Linux
How you installed DGL (conda, pip, source): conda
Build command you used (if compiling from source): bash ./script/build_dgl.sh -g -e '-DBUILD_GRAPHBOLT=ON'
Python version: 3.8.17
CUDA/cuDNN version (if applicable): 11.8
GPU models and configuration (e.g. V100): EC2 g4dn-metal
Any other relevant information: torch version 2.0.0+cu118

Additional context

Dec 05 '23 10:12 RamonZhou

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

Jan 13 '24 01:01 github-actions[bot]

It runs OK in the DGL NGC container 24.01. Will try to repo without container.

root@ecfeb6f27748:/opt/dgl/dgl-source/examples/multigpu# python node_classification_sage.py --gpu 0,1,2,3
Training in mixed mode using 4 GPU(s)
Loading data
This will download 1.38GB. Will you proceed? (y/N)
y
Downloading http://snap.stanford.edu/ogb/data/nodeproppred/products.zip
Downloaded 1.38 GB: 100%|| 1414/1414 [00:20<00:00, 67.44it/s]
Extracting dataset/products.zip
Loading necessary files...
This might take a while.
Processing graphs...
100%|1/1 [00:01<00:00,  1.66s/it]
Converting graphs into DGL objects...
100%|1/1 [00:00<00:00,  4.29it/s]
Saving...
Training...
Epoch 00000 | Loss 2.4287 | Accuracy 0.7714 | Time 5.3175
Epoch 00001 | Loss 0.9772 | Accuracy 0.8369 | Time 4.5307
Epoch 00002 | Loss 0.7459 | Accuracy 0.8538 | Time 4.5606
Epoch 00003 | Loss 0.6612 | Accuracy 0.8641 | Time 4.5573
Epoch 00004 | Loss 0.5897 | Accuracy 0.8713 | Time 4.5261
Epoch 00005 | Loss 0.5837 | Accuracy 0.8748 | Time 4.5167
Epoch 00006 | Loss 0.5278 | Accuracy 0.8786 | Time 4.5288
Epoch 00007 | Loss 0.4970 | Accuracy 0.8826 | Time 4.5237
Epoch 00008 | Loss 0.5000 | Accuracy 0.8828 | Time 4.5237
Epoch 00009 | Loss 0.4745 | Accuracy 0.8863 | Time 4.5328
Testing...
Test accuracy 0.7296

Feb 09 '24 01:02 TristonC

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

Mar 10 '24 01:03 github-actions[bot]