dgl icon indicating copy to clipboard operation
dgl copied to clipboard

[GraphBolt][MultiGPU] Error occurs when running multiGPU example with `num-workers` > 0

Open Skeleton003 opened this issue 1 year ago • 0 comments

🔨Work Item

IMPORTANT:

  • This template is only for dev team to track project progress. For feature request or bug report, please use the corresponding issue templates.
  • DO NOT create a new work item if the purpose is to fix an existing issue or feature request. We will directly use the issue in the project tracker.

Project tracker: https://github.com/orgs/dmlc/projects/2

Description

When running python examples/multigpu/graphbolt/node_classification.py --num-workers=2 (2 could be any number greater than 0), this error is raised within every distributed replica:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/ubuntu/miniconda3/envs/dgl/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/home/ubuntu/miniconda3/envs/dgl/lib/python3.9/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "/home/ubuntu/miniconda3/envs/dgl/lib/python3.9/site-packages/torch/utils/data/datapipes/datapipe.py", line 359, in __setstate__
    self._datapipe = dill.loads(value)
  File "/home/ubuntu/miniconda3/envs/dgl/lib/python3.9/site-packages/dill/_dill.py", line 303, in loads
    return load(file, ignore, **kwds)
  File "/home/ubuntu/miniconda3/envs/dgl/lib/python3.9/site-packages/dill/_dill.py", line 289, in load
    return Unpickler(file, ignore=ignore, **kwds).load()
  File "/home/ubuntu/miniconda3/envs/dgl/lib/python3.9/site-packages/dill/_dill.py", line 444, in load
    obj = StockUnpickler.load(self)
AttributeError: 'PyCapsule' object has no attribute 'cudaHostUnregister'

Depending work items or issues

Skeleton003 avatar May 08 '24 07:05 Skeleton003