GraphScope feat(python): Unify the graph level load_from and save to API & Bump up vineyard to v0.21.5

What do these changes do?

Unify the load_from and save_to API to support different kind of datasource. related discussion issue: #2836 #2920

`save_to`

def save_to(self, path, format="serialization", **kwargs)

We use save_to to dump graph to certain format data, currently support graphar and serialization and it's extensible to add a new format.

path output dir, support local,oss,s3,hdfs
format format to save, default is 'serialization'
related configurations the writing of each format may attach some related configurations. We set these configurations with naming strategy "format_config1", "format_config2". For example, graphar provide configuration:
- graphar_graph_name
- graphar_vertex_chunk_size
- graphar_edge_chunk_size
- graphar_file_type
- graphar_store_in_local
selector to dump subgraph with selected vertices and edges User can define selector to only dump certain set of vertex and edge, example:

selector = {
    "vertices": {
           "person": ["id", "firstName", "secondName"],
           "comment": None,  # select all properties
     }，
    "edges": {
           "knows": ["CreationDate"],
           "replyOf": None,
    }，
}

return format

{"type": format, "URI": "format+file:///tmp/graph/xxx"}

Example:

# graphar
g.save_to("/tmp/graphar/", format="graphar", graphar_graph_name="ldbc", vertex_chunk_size=1024,
                  graphar_edge_chunk_size=4096, graphar_file_type="parquet", graphar_store_in_local=False)
{'type': 'grpahar', 'URI': 'graphar+file:///tmp/graphar/ldbc.graph.yml'}

# graphar with selector
selector = {
    "vertices": {
           "person": ["id", "firstName", "secondName"],
           "comment": None,  # select all properties
     }，
    "edges": {
           "knows": ["CreationDate"],
           "replyOf": None,
    }，
}

g.save_to("/tmp/graphar/", format="graphar", selector=selector, graphar_graph_name="ldbc")
{'type': 'grpahar', 'URI': 'graphar+file:///tmp/graphar/ldbc.graph.yml'}

# serialization
g.save_to("/tmp/serialization/")
{'type':'serialization', 'URI': '/tmp/serialization/'}

`load_from`

def load_from(uri, **kwargs)

We use load_from to load graph to certain format data source, currently support graphar and serialization and it's extensible to add a new data source.

uri examples

graphar+file:///tmp/graphar/ldbc.graph.yaml
graphar+oss://bucket/graphar/ldbc.graphar.yaml
/tmp/serialization

related configurations the loading of each format may attach some related configurations. We set these configurations with naming strategy "format_config1", "format_config2". For example, graphar provide configuration:storage_options. for example graphar can set configurations:

- graphar_store_in_local # the graphar files are store in local file system of workers, only support for local file system.

selector to load subgraph with selected vertices and edges User can define selector to only load certain set of vertex and edge. example:

selector = {
    "vertices": {
           "person": None  # select all properties
           "comment": None  
     }，
    "edges": {
           "knows": None
           "replyOf": None
    }，
}

return A new Graph

Example:

# graphar
```python

Graph.load_from("graphar+file:///tmp/graphar/ldbc.graph.yml", graphar_store_in_local=True)

# graphar with selector
selector = {
    "vertices": {
           "person": None  # select all properties
           "comment": None  
     }，
    "edges": {
           "knows": None
           "replyOf": None
    }，
}

Graph.load_from("graphar+file:///tmp/graphar/ldbc.graph.yml", selector=selector)

# serialization
g.load_from("/tmp/serialization/")

Related issue number

Fixes #2836 Fixes #2920

Mar 06 '24 01:03 acezen

Codecov Report

Attention: Patch coverage is 89.03226% with 17 lines in your changes are missing coverage. Please review.

Project coverage is 42.96%. Comparing base (d46a354) to head (0971f2b). Report is 12 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #3610       +/-   ##
===========================================
+ Coverage   27.76%   42.96%   +15.19%     
===========================================
  Files         178      188       +10     
  Lines       16245    17540     +1295     
===========================================
+ Hits         4511     7536     +3025     
+ Misses      11734    10004     -1730

Files	Coverage Δ
python/graphscope/__init__.py	`82.85% <ø> (-0.48%)`	:arrow_down:
python/graphscope/client/session.py	`68.25% <ø> (+9.00%)`	:arrow_up:
python/graphscope/framework/dag_utils.py	`65.31% <100.00%> (+22.94%)`	:arrow_up:
python/graphscope/framework/graph_builder.py	`87.17% <ø> (+24.67%)`	:arrow_up:
python/graphscope/tests/unittest/test_graphar.py	`100.00% <100.00%> (ø)`
python/graphscope/framework/graph.py	`82.87% <76.05%> (+17.68%)`	:arrow_up:

... and 89 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update bca8b74...0971f2b. Read the comment docs.

Mar 06 '24 02:03 codecov-commenter

https://github.com/alibaba/GraphScope/actions/runs/8720836297/job/23936695839?pr=3610 The k8s-test always cancel after ~44m, Can someone help me with the CI? @siyuan0322 @dashanji @lidongze0629

Apr 18 '24 02:04 acezen

@siyuan0322 Since manylinux2014-ci-test tag workflow has passed in https://github.com/alibaba/GraphScope/actions/runs/8733355972 , I have updated the manylinux2014 tag runner. FYI

Apr 18 '24 08:04 acezen