dgl [RFC] Support for graph level features

🚀 Feature

The purpose of this issue is to resume discussion on graph level features from issues #417, #714, #737, #1316, #1449. Implementing support for graph level features for dgl.DGLGraph could be beneficial for various reasons. I will try to soon follow with a PR if this RFC is reviewed positively.

Motivation

Multiple GNN models are using graph level features or global features not only as constant tensors or numbers included in message passing functions but rather as learnable parameters. Relevant examples were given in above mentioned issues. Additional example could be MEGNet model, described in https://arxiv.org/pdf/1812.05055.pdf. Implementing graph level features into dgl.DGLGraph would surely be convenient for users that are now forced to write their own workarounds.

Pitch

Graph level features should be easy to set, accessible for all message passing functions and correctly concatenated during dgl.batch() function. Usage of graph level features in DGL could look like this:

g.gdata["feat"] = graph_level_features

def edge_udf(self, edges):
   graph_level_features = edges._graph.gdata["feat"]
 
def node_udf(self, nodes):
   graph_level_features = edges._graph.gdata["feat"]

Alternatives

Currently, as established in above mentioned issues, DGL doesn't directly support graph level features, neither in singular DGLGraph class nor in batch graphs. Two alternatives have been presented.

User can add gdata attribute to DGLGraph object using setattr() python function

setattr(g, "gdata", {})
g.gdata["feat"] = graph_level_features

This option is insufficient as gdata will be inaccessible in edge_udf() or node_udf(). Those functions operate on dgl.udf.EdgeBatch and dgl.udf.NodeBatch objects respectively and although dataset graph can be accessed (by calling dgl.udf.EdgeBatch._graph or dgl.udf.NodeBatch._graph) gdata set by user will not be included in this _graph attribute.

User can add gdata dictionary to his GNN model as class attribute

class Model:
   def __init__(self):
      self.gdata = {}
      self.gdata["feat"] = graph_level_features

This option allows access to gdata in node and edge udf, within this specific class, but user still has to include graph level features into batch of graphs on his own, where it could be done automatically, as it is for edge and node features. It is also more problematic with complex models that use multiple classes and modules, which can modify those graph level features. In such case user needs to include additional parameter in all relevant class constructors and their functions to properly track and update graph level features. Instead, it could be passed with DGLGraph as an attribute.

Jan 12 '24 16:01 agrabows

Hi, just as you've pointed out, this is indeed an often-discussed topic. My view of this is to break it into two aspects:

How to support graph-level features in the dataloading pipeline. This is related to the future plan of GraphBolt @frozenbugs
How to utilize graph-level features in message passing.

As your request is more about the second point, my suggestion is to utilize DGL's readout/broadcast operations to convert between graph-level features and node-/edge- level features so that you can directly use them in your UDFs. See the following APIs:

https://docs.dgl.ai/api/python/dgl.html#batching-and-reading-out-ops
https://docs.dgl.ai/generated/dgl.broadcast_nodes.html#dgl.broadcast_nodes
https://docs.dgl.ai/generated/dgl.broadcast_edges.html#dgl.broadcast_edges

Jan 18 '24 02:01 jermainewang

@jermainewang Broadcasting graph level features into node features solves problems that I mentioned, but on the other hand it makes entire model unnecessarily slower as much more calculations have to be done. By supporting graph-level features in dataloading pipeline do you mean that they will be available by calling DGLGraph? Because that would mean they will be callable during message passing if I understand correctly.

Jan 23 '24 13:01 agrabows

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

Feb 23 '24 01:02 github-actions[bot]

but on the other hand it makes entire model unnecessarily slower as much more calculations have to be done

I'm curious how to avoid this if graph level features are introduced to the message passing abstraction? My understanding is that it may require dedicated GPU kernels. Remember each graph in a batch may have different numbers of nodes and edges, so it is not like a normal dense broadcasting.

Feb 29 '24 01:02 jermainewang

@jermainewang Since the GraphBolt has come out, any updates on this issue? I believe having something like graph.gdata which accommodates dgl.batch and dgl.unbatch is useful, especially in heterographs.

Apr 22 '24 11:04 e-yi

The main blocker for this feature request is now at the operator/kernel level. We didn't see a clear answer other than broadcasting/gathering global graph-level information to/from node/edge-level information, which will be equivalent to the broadcasting and readout ops listed above.

Another direction to drive the request is to focus on API level. For homogeneous graphs, using broadcast/readout ops to handle graph-level data doesn't seem to be a big hassle. @e-yi If you could share some code examples about how difficult is it to use them with heterogeneous graphs, that will be helpful.

Apr 25 '24 01:04 jermainewang

For homogeneous graphs, using broadcast/readout ops to handle graph-level data doesn't seem to be a big hassle.

Yes, and, please correct me if I'm wrong, the dgl.function APIs are mostly just wrappers of the dgl.ops module, so in most cases, it's also quite easy to use dgl without assigning graph.ndata or graph.edata (and I think my code becomes more elegant this way). My original intuition is very simple; I wish to have a way to organize all graph-related features in dgl.DGLGraph objects, and it should at least support dgl.batch operations.

May 04 '24 09:05 e-yi