GraphScope icon indicating copy to clipboard operation
GraphScope copied to clipboard

feat(python): Unify the graph level load_from and save to API & Bump up vineyard to v0.21.5

Open acezen opened this issue 1 year ago • 1 comments

What do these changes do?

Unify the load_from and save_to API to support different kind of datasource. related discussion issue: #2836 #2920

save_to

def save_to(self, path, format="serialization", **kwargs)

We use save_to to dump graph to certain format data, currently support graphar and serialization and it's extensible to add a new format.

  • path output dir, support local,oss,s3,hdfs

  • format format to save, default is 'serialization'

  • related configurations the writing of each format may attach some related configurations. We set these configurations with naming strategy "format_config1", "format_config2". For example, graphar provide configuration:

    • graphar_graph_name
    • graphar_vertex_chunk_size
    • graphar_edge_chunk_size
    • graphar_file_type
    • graphar_store_in_local
  • selector to dump subgraph with selected vertices and edges User can define selector to only dump certain set of vertex and edge, example:

selector = {
    "vertices": {
           "person": ["id", "firstName", "secondName"],
           "comment": None,  # select all properties
     },
    "edges": {
           "knows": ["CreationDate"],
           "replyOf": None,
    },
}
  • return format
{"type": format, "URI": "format+file:///tmp/graph/xxx"}

Example:

# graphar
g.save_to("/tmp/graphar/", format="graphar", graphar_graph_name="ldbc", vertex_chunk_size=1024,
                  graphar_edge_chunk_size=4096, graphar_file_type="parquet", graphar_store_in_local=False)
{'type': 'grpahar', 'URI': 'graphar+file:///tmp/graphar/ldbc.graph.yml'}

# graphar with selector
selector = {
    "vertices": {
           "person": ["id", "firstName", "secondName"],
           "comment": None,  # select all properties
     },
    "edges": {
           "knows": ["CreationDate"],
           "replyOf": None,
    },
}

g.save_to("/tmp/graphar/", format="graphar", selector=selector, graphar_graph_name="ldbc")
{'type': 'grpahar', 'URI': 'graphar+file:///tmp/graphar/ldbc.graph.yml'}

# serialization
g.save_to("/tmp/serialization/")
{'type':'serialization', 'URI': '/tmp/serialization/'}

load_from

def load_from(uri, **kwargs)

We use load_from to load graph to certain format data source, currently support graphar and serialization and it's extensible to add a new data source.

  • uri examples
graphar+file:///tmp/graphar/ldbc.graph.yaml
graphar+oss://bucket/graphar/ldbc.graphar.yaml
/tmp/serialization
  • related configurations the loading of each format may attach some related configurations. We set these configurations with naming strategy "format_config1", "format_config2". For example, graphar provide configuration:storage_options. for example graphar can set configurations:
- graphar_store_in_local # the graphar files are store in local file system of workers, only support for local file system.
  • selector to load subgraph with selected vertices and edges User can define selector to only load certain set of vertex and edge. example:
selector = {
    "vertices": {
           "person": None  # select all properties
           "comment": None  
     },
    "edges": {
           "knows": None
           "replyOf": None
    },
}
  • return A new Graph

Example:

# graphar
```python

Graph.load_from("graphar+file:///tmp/graphar/ldbc.graph.yml", graphar_store_in_local=True)

# graphar with selector
selector = {
    "vertices": {
           "person": None  # select all properties
           "comment": None  
     },
    "edges": {
           "knows": None
           "replyOf": None
    },
}

Graph.load_from("graphar+file:///tmp/graphar/ldbc.graph.yml", selector=selector)

# serialization
g.load_from("/tmp/serialization/")

Related issue number

Fixes #2836 Fixes #2920

acezen avatar Mar 06 '24 01:03 acezen

Codecov Report

Attention: Patch coverage is 89.03226% with 17 lines in your changes are missing coverage. Please review.

Project coverage is 42.96%. Comparing base (d46a354) to head (0971f2b). Report is 12 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##             main    #3610       +/-   ##
===========================================
+ Coverage   27.76%   42.96%   +15.19%     
===========================================
  Files         178      188       +10     
  Lines       16245    17540     +1295     
===========================================
+ Hits         4511     7536     +3025     
+ Misses      11734    10004     -1730     
Files Coverage Δ
python/graphscope/__init__.py 82.85% <ø> (-0.48%) :arrow_down:
python/graphscope/client/session.py 68.25% <ø> (+9.00%) :arrow_up:
python/graphscope/framework/dag_utils.py 65.31% <100.00%> (+22.94%) :arrow_up:
python/graphscope/framework/graph_builder.py 87.17% <ø> (+24.67%) :arrow_up:
python/graphscope/tests/unittest/test_graphar.py 100.00% <100.00%> (ø)
python/graphscope/framework/graph.py 82.87% <76.05%> (+17.68%) :arrow_up:

... and 89 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update bca8b74...0971f2b. Read the comment docs.

codecov-commenter avatar Mar 06 '24 02:03 codecov-commenter

https://github.com/alibaba/GraphScope/actions/runs/8720836297/job/23936695839?pr=3610 The k8s-test always cancel after ~44m, Can someone help me with the CI? @siyuan0322 @dashanji @lidongze0629

acezen avatar Apr 18 '24 02:04 acezen

@siyuan0322 Since manylinux2014-ci-test tag workflow has passed in https://github.com/alibaba/GraphScope/actions/runs/8733355972 , I have updated the manylinux2014 tag runner. FYI

acezen avatar Apr 18 '24 08:04 acezen