incubator-graphar icon indicating copy to clipboard operation
incubator-graphar copied to clipboard

[OSPP] GraphAr Extension for Kuzu

Open Thespica opened this issue 9 months ago • 11 comments

Description

Apache GraphAr (incubating) is a standard graph storage file format framework, with underlying storage based on CSV format, as well as Apache Parquet or Apache ORC columnar file formats. Kuzu is an embedded graph database designed for fast querying and good scalability, optimized for complex analytical workloads on very large databases, and it provides a set of retrieval features. Kuzu natively supports importing data from various formats such as CSV, Parquet, NumPy, and JSON. Additionally, Kuzu offers an extension framework for better integration with external systems, with extensions already implemented for systems like DuckDB, PostgreSQL, and Neo4j.

This project aims to develop an extension for GraphAr, enabling Kuzu to directly load data from the GraphAr format, thereby leveraging Kuzu's powerful features to facilitate efficient querying and processing of graph data stored in the GraphAr format.

Output Requirements

  1. Complete a Kuzu extension tool that can import data from the GraphAr format.
  2. Provide complete documentation that details the steps and functionalities for using the extension and includes examples to ensure users can get started smoothly.
  3. (Bonus point) Promote the tool to become an official Kuzu extension.

Technical Requirements

  1. Familiar with the C++ programming language.
  2. Understanding of graph database concepts / relevant background in graph databases.

More Details

Apply at OSPP: https://summer-ospp.ac.cn/org/prodetail/25e7a0282

Expected Completion Hours: 240 Hours

Thespica avatar May 05 '25 14:05 Thespica

I've been working on something similar to this. Blog post with details.

Since I was not aware of graph-ar, I invented my own. Proposed syntax:

CREATE NODE TABLE Person(ID INT64, name STRING, PRIMARY KEY(ID))
WITH (storage = '/tmp/karate_random');

CREATE REL TABLE knows(FROM Person TO Person, weight DOUBLE)
WITH (storage = '/tmp/karate_random');

MATCH (p1:Person)-[r:knows]->(p2:Person) RETURN p1.ID, p2.ID, r.weight;

I'm not sure this can be achieved in an extension. So I modified kuzu parser to support this syntax.

One of the differences vs the proposal is that I use kuzu table catalog to store metadata instead of yaml files. This is similar to Ducklake vs Iceberg.

adsharma avatar Sep 21 '25 21:09 adsharma

I've been working on something similar to this. Blog post with details.

Since I was not aware of graph-ar, I invented my own. Proposed syntax:

CREATE NODE TABLE Person(ID INT64, name STRING, PRIMARY KEY(ID))
WITH (storage = '/tmp/karate_random');

CREATE REL TABLE knows(FROM Person TO Person, weight DOUBLE)
WITH (storage = '/tmp/karate_random');

MATCH (p1:Person)-[r:knows]->(p2:Person) RETURN p1.ID, p2.ID, r.weight;

I'm not sure this can be achieved in an extension. So I modified kuzu parser to support this syntax.

One of the differences vs the proposal is that I use kuzu table catalog to store metadata instead of yaml files. This is similar to Ducklake vs Iceberg.

@adsharma Thanks for sharing! I see usecases are slightly different. Your format is simpler (from the integration point of view) and it is well suited for the in-memory graph processing. At the same time, GAR is aiming to provide more scalable approach for out of core processing.

Let's imagine we have a property graph with movies and people and the following edge groups:

  • people can like movies
  • people can message each other
  • people can follow each other
  • movie can be a part of series (prequel, sequel, prev series, next series, etc.)

GAR stores properties in a separate groups as well as it splits all of them to chunks. For example, if we want to do something like MATCH (:people {name: Alice}) -> [likes] -> (:movies) with GAR, we do not need to:

  • load to memory anything, except people, like, movie (much less edges in memory)
  • based on the known ID of people start vertex and vertex chunk size, we can find the chunk that contains only this vertex and load to memory only it
  • based on edges chunk size and known the source people ID we can find chunks of like edges that starts from this ID (because in GAR edges are pre-sorted before splitting to chunks) and load only this chunk

That allows to run queries on really huge graphs stored in parquet files without need to load all the graph to memory.

With your format, as I can understand, we would need to:

  • load at least the full nodes parquet to filter out only people properties
  • load at least all the edges from the source people vertex to filter only like

But it should not be a problem for in-memory processing if we are loading a lot for fast queries anyway.

SemyonSinchenko avatar Sep 22 '25 09:09 SemyonSinchenko

Thank you for explaining! My solution is motivated by trying to serve wikidata (90 million nodes, 800+ million edges) from kuzu. The on-disk storage requirements were unacceptable due to denormalization. I'm looking to serve graphs 10x that size. So on-disk and selective loading is the main use case.

I also want to compete with LLMs in terms of graph compression and storage efficiency by offloading some of the knowledge stored there into external storage.

Parquet files as they stand now aren't sufficient, but a step in the right direction.

I don't want to specify whether the edges should be sorted by type or by graph structure. Depends on the use case. Want to support both well.

Kuzu folks have made a decision to support strongly typed nodes and edges. But you can always store weakly typed graphs by merging them all into a "node" table and a "rel" table.

If the parquet file is sorted, you can do predicate pushdowns. DuckDB and Spark do it.

DuckDB native storage is also supported as an additional single file option. Why? It has a few more compression tricks and single file is more convenient. In the TPC-H SF10k example, parquet files were 4TB, but duckdb was 2.7TB.

Kuzu has an extension to read from duckdb. But I'm not sure if it can handle TB sized files and do efficient predicate pushdowns.

adsharma avatar Sep 22 '25 19:09 adsharma

I will read through the GAR spec and align where it makes sense. For example use vertices.parquet instead of nodes.parquet. Remove filename prefix and rely on the directory name for the prefix etc.

Where we diverge:

  • I'm exclusively focusing on CSR (sort by source, then by edge attributes including type, then by destination).
  • Use kuzu catalog for metadata (avoid json, yaml). Similar to DuckLake.
  • Potential support for duckdb single file storage in the future
  • uv + python based tooling to take any arbitrary billion scale graph, configurable sort as explained above and produce a directory in the standard format.

Open datalakes for graphs is an exciting idea. Thank you for bringing it up!

adsharma avatar Sep 22 '25 19:09 adsharma

Your point and blog are insightful and interesting! @adsharma

GraphAr enables graph data to flow across different systems (graph databases and analysis systems), so its metadata is designed to record the graph structure and file locations, allowing different systems to find and load only the vertices/edges/properties they need.

“GraphAr” is short for "Graph Archive". It is designed to be immutable and does not support transactions. Therefore, GAR’s metadata does not need to store change logs or other information.

This, in my view, is the key difference between GraphAr’s metadata and the metadata used in DuckLake.

yangxk1 avatar Sep 23 '25 12:09 yangxk1

DuckLake v1 didn't support writes. What it does is orthogonal to the mutability of data.

The higher order bit is to use SQL catalog (not necessarily duckdb, it supports postgres and mysql as well) for metadata instead of json/yaml.

In the graph world, we have GQL and PGQ. But they're relatively recent and not implemented everywhere. Cypher and its variants are more widespread in the industry.

Building consensus around an acceptable DDL language seems preferable to using yaml/json, regardless of whether the graph is mutable or immutable.

Kuzu/DuckDB are interesting because of single-file embedded DB convenience. But really you could bring any database and it's DDL as long as it doesn't fragment the standard too much.

adsharma avatar Sep 23 '25 19:09 adsharma

I think DDL is worth discussing, maybe we can start an issue to discuss it, or I can start an official discussion on a mailing list @adsharma @SemyonSinchenko

yangxk1 avatar Sep 24 '25 03:09 yangxk1

The higher order bit is to use SQL catalog (not necessarily duckdb, it supports postgres and mysql as well) for metadata instead of json/yaml.

Within a single system, using a catalog clearly offers advantages.

However, GAR is designed for cross-system interoperability, and YAML is universally readable, even by humans. Given that we cannot require users to adopt any specific catalog implementation (such as DuckLake’s or Kuzu’s). I don't know if there is a public common SQL structure for creating catalog tables. And many systems don't support SQL and don't want to import it.

We can support catalog-based metadata in systems that provide a catalog (e.g., Kuzu, DuckDB, etc.), while also fully supporting the generic YAML format for import and export to ensure broad interoperability.

yangxk1 avatar Sep 24 '25 08:09 yangxk1

Thank you for starting the discussion. Yes, it's a bit more challenging for graphs because of the heterogeneity of languages and implementations. In the SQL world, DuckDB folks had an easier time making the case with implementations involving 3 widely used open source databases.

We'll have to see where things stand in a couple of years with DuckLake to understand how the industry received it.

adsharma avatar Sep 24 '25 15:09 adsharma

Hi @adsharma, I noticed that you submitted the PR https://github.com/LadybugDB/ladybug/pull/3 stored in parquet files in ladybugDB. I haven't read the specific implementation yet, and I'm thinking if this is compatible with GAR.

yangxk1 avatar Oct 17 '25 08:10 yangxk1

@yangxk1 they're semantically compatible, but syntactically different. Thinking I'll write up a short blog post describing how they ended up being different and link it here. We can then discuss a path forward.

adsharma avatar Oct 17 '25 16:10 adsharma