dgl icon indicating copy to clipboard operation
dgl copied to clipboard

【GraphBolt】【HeteroGraph】HeteroGraph can not generate batch

Open Ying-1106 opened this issue 1 year ago • 11 comments

🐛 Bug

When I was using GraphBolt for a heterogeneous graph link prediction task, errors frequently occurred during batch generation. I created a dataset called HGBl-amazon, which includes one type of node: product, and two types of edges: Product-0-Product and Product-1-Product. I constructed a link prediction task and stored edge information in the train_set, val_set and test_set like GraphBolt examples. However, I always encountered errors while iterating through the dataloader.

code: self.model.train() loss_all = 0.0 for i, data in enumerate(self.train_dataloader): # this line always raise error.

Ying-1106 avatar Jun 12 '24 13:06 Ying-1106

this is the code about generating Dataset:

base_dir = os.path.join(now_dir,'HGBl_base_dir')

construct the Ondiskdataset from existed dglgraph

graph_file_path = '/data/zzh/TEST_DIR/HGBl_dir/HGBl-amazon_DGLGraph.bin' HGBl_Graph = dgl.load_graphs(filename=graph_file_path)[0][0]

feature = HGBl_Graph.ndata['h'] product_feat_np = feature.numpy() product_feat_file = os.path.join(base_dir,'product_feat_file.npy') np.save(file=product_feat_file,arr=product_feat_np)

src,dst = HGBl_Graph.edges(etype=('product','product-product-0','product') ) src = src.numpy() dst = dst.numpy() P0P_npy = np.stack((src, dst)) P0P_npy_file = os.path.join(base_dir,'P0P.npy') np.save(file=P0P_npy_file,arr=P0P_npy)

src,dst = HGBl_Graph.edges(etype=('product','product-product-1','product') ) src = src.numpy() dst = dst.numpy() P1P_npy = np.stack((src, dst)) P1P_npy_file = os.path.join(base_dir,'P1P.npy') np.save(file=P1P_npy_file,arr=P1P_npy)

#The edge information numpy files in train_set, val_set, and test_set have been stored locally, and each set includes the source and target node IDs of two types of edges, P-0-P and P-1-P

Train set

train_set_POP_path = "/data/zzh/TEST_DIR/HGBl_base_dir/train_set_P0P.npy" train_set_P1P_path = "/data/zzh/TEST_DIR/HGBl_base_dir/train_set_P1P.npy"

val set

val_set_POP_path = "/data/zzh/TEST_DIR/HGBl_base_dir/val_set_P0P.npy" val_set_P1P_path = "/data/zzh/TEST_DIR/HGBl_base_dir/val_set_P1P.npy"

test set

test_set_POP_path = "/data/zzh/TEST_DIR/HGBl_base_dir/test_set_P0P.npy" test_set_P1P_path = "/data/zzh/TEST_DIR/HGBl_base_dir/test_set_P1P.npy"

yaml_content = f""" dataset_name: HGBl_amazon_GB graph: nodes: - type: product num: 10099

    edges:
      - type: "product:product-product-0:product"
        format: numpy
        path: {os.path.basename(P0P_npy_file)}

      - type: "product:product-product-1:product"
        format: numpy
        path: {os.path.basename(P1P_npy_file)}
     
  feature_data:

    - domain: node
      type: product
      name: feat
      format: numpy
      in_memory: false
      path: {os.path.basename(product_feat_file)}

  tasks:
    - name: link_prediction
      num_classes: 100
      train_set:
        - type: "product:product-product-0:product"
          data:
            - name: seeds
              format: numpy
              path: {os.path.basename(train_set_POP_path)}

        - type: "product:product-product-1:product"
          data:
            - name: seeds
              format: numpy
              path: {os.path.basename(train_set_P1P_path)}
      
      validation_set:
        - type: "product:product-product-0:product"
          data:
            - name: seeds
              format: numpy
              path: {os.path.basename(val_set_POP_path)}

        - type: "product:product-product-1:product"
          data:
            - name: seeds
              format: numpy
              path: {os.path.basename(val_set_P1P_path)}

      test_set:
        - type: "product:product-product-0:product"
          data:
            - name: seeds
              format: numpy
              path: {os.path.basename(test_set_POP_path)}

        - type: "product:product-product-1:product"
          data:
            - name: seeds
              format: numpy
              path: {os.path.basename(test_set_P1P_path)}

"""

metadata_path = os.path.join(base_dir, "metadata.yaml") with open(metadata_path, "w") as f: f.write(yaml_content)

dataset = gb.OnDiskDataset(base_dir).load() graph = dataset.graph.to(device) feature = dataset.feature.to(device) tasks = dataset.tasks link_pred_task = tasks[0]

datapipe = gb.ItemSampler(link_pred_task.train_set, batch_size=16, shuffle=True) datapipe = datapipe.copy_to(device) datapipe = datapipe.sample_uniform_negative(graph, 1) datapipe = datapipe.sample_neighbor(graph, [-1, -1,-1]) datapipe = datapipe.fetch_feature( feature, node_feature_keys={"product": ["feat"]} )

dataloader = gb.DataLoader(datapipe,num_workers=0)

Ying-1106 avatar Jun 12 '24 13:06 Ying-1106

Hello @Ying-1106, it'd be helpful if you can provide the error message. And you can try print(train_set) to examine the training set and check if data is correct.

Skeleton003 avatar Jun 12 '24 18:06 Skeleton003

And please share which DGL version you're using.

Rhett-Ying avatar Jun 13 '24 00:06 Rhett-Ying

And please share which DGL version you're using.

My DGL version is 2.2.1 + cu118

Ying-1106 avatar Jun 13 '24 02:06 Ying-1106

Hello @Ying-1106, it'd be helpful if you can provide the error message. And you can try print(train_set) to examine the training set and check if data is correct.

when i print train_set

print(link_pred_task.train_set) ItemSetDict( itemsets={'product:product-product-0:product': ItemSet( items=(tensor([[ 552, 7161], [8166, 9154], [2310, 2945], ..., [1367, 4038], [ 728, 7947], [5994, 5039]], dtype=torch.int32),), names=('seeds',), ), 'product:product-product-1:product': ItemSet( items=(tensor([[ 454, 8906], [7462, 9232], [8126, 359], ..., [4892, 731], [6761, 3064], [8407, 9684]], dtype=torch.int32),), names=('seeds',), )}, names=('seeds',), )

the error

Whenever I step through this line 【for step, data in enumerate(dataloader):】, the code terminates abruptly, and the terminal outputs either "free(): invalid size," "munmap_chunk(): invalid pointer," or "double free or corruption (out)." Any of these three errors might be output.

Ying-1106 avatar Jun 13 '24 03:06 Ying-1106

Hello @Ying-1106, it'd be helpful if you can provide the error message. And you can try print(train_set) to examine the training set and check if data is correct.

it's the error message:

RuntimeError (note: full exception trace is shown but execution is paused at: _run_module_as_main) CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

This exception is thrown by iter of Bufferer(datapipe=FeatureFetcher) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator response = gen.send(None) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 124, in iter for data in self.datapipe: File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator response = gen.send(None) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 124, in iter for data in self.datapipe: File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator response = gen.send(None) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 125, in iter yield self._apply_fn(data) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 90, in _apply_fn return self.fn(data) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/dgl/graphbolt/minibatch_transformer.py", line 38, in _transformer minibatch = self.transformer(minibatch) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/dgl/graphbolt/subgraph_sampler.py", line 65, in _preprocess ) = SubgraphSampler._seeds_preprocess(minibatch) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/dgl/graphbolt/subgraph_sampler.py", line 166, in _seeds_preprocess unique_seeds, compacted = unique_and_compact(nodes) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/dgl/graphbolt/internal/sample_utils.py", line 56, in unique_and_compact unique[ntype], compacted[ntype] = unique_and_compact_per_type( File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/dgl/graphbolt/internal/sample_utils.py", line 47, in unique_and_compact_per_type unique, compacted, _ = torch.ops.graphbolt.unique_and_compact( File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/ops.py", line 854, in call return self._op(*args, **(kwargs or {})) RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

This exception is thrown by iter of MiniBatchTransformer(datapipe=UniformNegativeSampler, transformer=_preprocess)

During handling of the above exception, another exception occurred:

File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator response = gen.send(None) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 124, in iter for data in self.datapipe: File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator response = gen.send(None) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 124, in iter for data in self.datapipe: File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 203, in wrap_generator full_msg = f"{msg} {datapipe.class.name}({_generate_input_args_string(datapipe)})" File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 43, in _generate_input_args_string result.append((name, _simplify_obj_name(value))) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 27, in _simplify_obj_name return repr(obj) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/dgl/graphbolt/impl/fused_csc_sampling_graph.py", line 39, in repr csc_indptr_str = str(self.csc_indptr) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor.py", line 464, in repr return torch._tensor_str._str(self, tensor_contents=tensor_contents) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 697, in _str return _str_intern(self, tensor_contents=tensor_contents) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 617, in _str_intern tensor_str = _tensor_str(self, indent) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 349, in _tensor_str formatter = _Formatter(get_summarized_data(self) if summarize else self) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 375, in get_summarized_data return torch.cat( RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

This exception is thrown by iter of CompactPerLayer(datapipe=SamplePerLayer, deduplicate=True)

During handling of the above exception, another exception occurred:

File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator response = gen.send(None) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 124, in iter for data in self.datapipe: File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator response = gen.send(None) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 124, in iter for data in self.datapipe: File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 203, in wrap_generator full_msg = f"{msg} {datapipe.class.name}({_generate_input_args_string(datapipe)})" File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 43, in _generate_input_args_string result.append((name, _simplify_obj_name(value))) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 27, in _simplify_obj_name return repr(obj) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/dgl/graphbolt/impl/fused_csc_sampling_graph.py", line 39, in repr csc_indptr_str = str(self.csc_indptr) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor.py", line 464, in repr return torch._tensor_str._str(self, tensor_contents=tensor_contents) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 697, in _str return _str_intern(self, tensor_contents=tensor_contents) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 617, in _str_intern tensor_str = _tensor_str(self, indent) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 349, in _tensor_str formatter = _Formatter(get_summarized_data(self) if summarize else self) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 375, in get_summarized_data return torch.cat( RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

This exception is thrown by iter of CompactPerLayer(datapipe=SamplePerLayer, deduplicate=True)

During handling of the above exception, another exception occurred:

File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator response = gen.send(None) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 124, in iter for data in self.datapipe: File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator response = gen.send(None) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/dgl/graphbolt/dataloader.py", line 68, in iter yield from self.dataloader File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in next data = self._next_data() File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 675, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 41, in fetch data = next(self.dataset_iter) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 152, in next return self._get_next() File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 140, in _get_next result = next(self.iterator) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 224, in wrap_next result = next_func(*args, **kwargs) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/datapipe.py", line 383, in next return next(self._datapipe_iter) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator response = gen.send(None) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 124, in iter for data in self.datapipe: File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator response = gen.send(None) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 124, in iter for data in self.datapipe: File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator response = gen.send(None) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 124, in iter for data in self.datapipe: File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator response = gen.send(None) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 124, in iter for data in self.datapipe: File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 203, in wrap_generator full_msg = f"{msg} {datapipe.class.name}({_generate_input_args_string(datapipe)})" File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 43, in _generate_input_args_string result.append((name, _simplify_obj_name(value))) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 27, in _simplify_obj_name return repr(obj) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/dgl/graphbolt/impl/fused_csc_sampling_graph.py", line 39, in repr csc_indptr_str = str(self.csc_indptr) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor.py", line 464, in repr return torch._tensor_str._str(self, tensor_contents=tensor_contents) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 697, in _str return _str_intern(self, tensor_contents=tensor_contents) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 617, in _str_intern tensor_str = _tensor_str(self, indent) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 349, in _tensor_str formatter = _Formatter(get_summarized_data(self) if summarize else self) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 375, in get_summarized_data return torch.cat( RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

This exception is thrown by iter of CompactPerLayer(datapipe=SamplePerLayer, deduplicate=True)

During handling of the above exception, another exception occurred:

File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 375, in get_summarized_data return torch.cat( File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 385, in return torch.stack([get_summarized_data(x) for x in (start + end)]) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 385, in get_summarized_data return torch.stack([get_summarized_data(x) for x in (start + end)]) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 349, in _tensor_str formatter = _Formatter(get_summarized_data(self) if summarize else self) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 617, in _str_intern tensor_str = _tensor_str(self, indent) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 697, in _str return _str_intern(self, tensor_contents=tensor_contents) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor.py", line 464, in repr return torch._tensor_str._str(self, tensor_contents=tensor_contents) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/dgl/graphbolt/impl/torch_based_feature_store.py", line 225, in repr str(self._tensor), " " * len(" feature=") File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/dgl/graphbolt/impl/torch_based_feature_store.py", line 432, in repr features_str = textwrap.indent(str(self._features), " ").strip() File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 27, in _simplify_obj_name return repr(obj) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 43, in _generate_input_args_string result.append((name, _simplify_obj_name(value))) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 203, in wrap_generator full_msg = f"{msg} {datapipe.class.name}({_generate_input_args_string(datapipe)})" File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/dgl/graphbolt/base.py", line 306, in iter for data in self.datapipe: File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator response = gen.send(None) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/dgl/graphbolt/base.py", line 325, in iter for data in self.datapipe: File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator response = gen.send(None) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/dgl/graphbolt/base.py", line 280, in iter yield from self.datapipe File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator response = gen.send(None) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/datapipe.py", line 383, in next return next(self._datapipe_iter) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 224, in wrap_next result = next_func(*args, **kwargs) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 140, in _get_next result = next(self.iterator) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 152, in next return self._get_next() File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 41, in fetch data = next(self.dataset_iter) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 675, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in next data = self._next_data() File "/data/zzh/TEST_DIR/GraphBolt_异质图(链接预测有BUG).py", line 696, in get_HGBl_amazon_GB for step, data in enumerate(dataloader): File "/data/zzh/TEST_DIR/GraphBolt_异质图(链接预测有BUG).py", line 750, in get_HGBl_amazon_GB() File "/data/zzh/anaconda3/envs/YING/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/runpy.py", line 196, in _run_module_as_main (Current frame) return _run_code(code, main_globals, None, RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

This exception is thrown by iter of Bufferer(datapipe=FeatureFetcher)

Ying-1106 avatar Jun 13 '24 03:06 Ying-1106

how do you generate the train_set? are the Node IDs in each seed is edge type wised?

Rhett-Ying avatar Jun 13 '24 03:06 Rhett-Ying

how do you generate the train_set? are the Node IDs in each seed is edge type wised?

I generate train_set with 2 numpy files. One is edge type P0P, another is edge type P1P as below:

tasks: - name: link_prediction num_classes: 2 train_set: - type: "product:product-product-0:product" data: - name: seeds format: numpy path: {os.path.basename(train_set_POP_path)} - type: "product:product-product-1:product" data: - name: seeds format: numpy path: {os.path.basename(train_set_P1P_path)}

the numpy array , this is the numpy array in train_set:

train_set_POP = np.load(train_set_POP_path) train_set_P1P = np.load(train_set_P1P_path) prin(train_set_P0P): train_set_POP array([[ 552, 7161], [8166, 9154], [2310, 2945], ..., [1367, 4038], [ 728, 7947], [5994, 5039]])

print(train_set_P1P): train_set_P1P array([[ 454, 8906], [7462, 9232], [8126, 359], ..., [4892, 731], [6761, 3064], [8407, 9684]])

Ying-1106 avatar Jun 13 '24 04:06 Ying-1106

In order to dive deep into the root cause, I recommend to narrow down the case with following suggestions.

  1. does it crash on first iteration?
  2. could you try with CPU sampling?
  3. try with small fanout, single layer.

Rhett-Ying avatar Jun 13 '24 08:06 Rhett-Ying

In order to dive deep into the root cause, I recommend to narrow down the case with following suggestions.

  1. does it crash on first iteration?
  2. could you try with CPU sampling?
  3. try with small fanout, single layer.

Thank you for your patient response. I have now resolved the issue, and the code for link prediction and node classification on heterogeneous graphs is running correctly. The previous bug might have been due to inconsistent devices.

Ying-1106 avatar Jun 14 '24 13:06 Ying-1106

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

github-actions[bot] avatar Jul 15 '24 01:07 github-actions[bot]