torchdrug icon indicating copy to clipboard operation
torchdrug copied to clipboard

[Problem] COO Sparse Tensor Multiplication Is Very Slow.

Open mrzzmrzz opened this issue 2 years ago • 2 comments

I found the sparse tensor multiplication is very slow in the GearNet module.

Here is the main code in the message_and_aggerate :

adjacency = utils.sparse_coo_tensor(
  torch.stack([node_in, node_out]), graph.edge_weight,
  (graph.num_node, graph.num_node * graph.num_relation)
)

When I leveraged the CSR sparse tensor to replace the original COO sparse tensor, the time spent running GearNet to predict protein labels was reduced by about 50%, e.g., from 16 minutes one epoch to 8 minutes per epoch for RTX 3090 (batch size : 8, GPU: 1).

I'm not sure whether this problem is caused by my own GPU device or the type of sparse tensor. If it's the latter, maybe I can open a pull request for it.

mrzzmrzz avatar Feb 27 '23 03:02 mrzzmrzz

That's a good catch! Do you know based on which PyTorch version you observe this speedup?

CSR is more efficient for matrix multiplication, while COO is more efficient for editing sparse matrices. We are not confident about the coverage of CSR in PyTorch so we fall back to COO everywhere. If CSR is well supported by PyTorch now, we will update TorchDrug accordingly. This will bring a huge acceleration to many GNN models.

KiddoZhu avatar Mar 28 '23 22:03 KiddoZhu

My Pytorch version is 1.12.1 with cuda 11.6. As far as I know, Pytorch support the CSR format in recent version.

mrzzmrzz avatar Mar 30 '23 12:03 mrzzmrzz