arrow ARROW-16855: [C++] Adding Read Relation ToProto

This is the initial PR to set the util functions and structure to include the ToProto functionality to relations. Here the objective is to create an ACERO relation by interpretting what is included in a Substrait-Relation. In this PR the read relation ToProto is added.

Jun 20 '22 01:06 vibhatha

https://issues.apache.org/jira/browse/ARROW-16855

Jun 20 '22 01:06 github-actions[bot]

:warning: Ticket has no components in JIRA, make sure you assign one.

Jun 20 '22 01:06 github-actions[bot]

cc @westonpace added the initial PR to integrate ToProto for relations. The detailed task breakdown for ToProto is documented in here: https://issues.apache.org/jira/browse/ARROW-16854

The idea is to add part by part in smaller PRs.

Jun 20 '22 01:06 vibhatha

I'm out Monday & Tuesday. Maybe @jvanstraten can take a look? Otherwise I can get to this on Wednesday

Jun 20 '22 07:06 westonpace

I'm out Monday & Tuesday. Maybe @jvanstraten can take a look? Otherwise I can get to this on Wednesday

Wednesday works for me 👍

Jun 20 '22 07:06 vibhatha

I don't feel qualified to comment on those design questions, but FWIW, I ran the serialized output of the test case through the validator and it's okay-ish (the validator doesn't like the lack of a NULLABILITY_REQUIRED in the struct that represents the schema, but that's pretty pedantic I guess), and the code looks fine to me.

Jun 20 '22 09:06 jvanstraten

I don't feel qualified to comment on those design questions, but FWIW, I ran the serialized output of the test case through the validator and it's okay-ish (the validator doesn't like the lack of a NULLABILITY_REQUIRED in the struct that represents the schema, but that's pretty pedantic I guess), and the code looks fine to me.

Thanks a lot for the quick check on this. It’s very interesting how you validated things using the tool. Do you think it’s wise to add a CI to test Substrait related queries using this tool?

Please feel free to add suggestions. @jvanstraten

One doubtful thing is to check in serialization is whether a projection or filter expression is added or not/ differentiation from default values. For instance filter expression defaults to a boolean literal of value true.

cc @westonpace For future reference in the review.

Jun 20 '22 09:06 vibhatha

Do you think it’s wise to add a CI to test Substrait related queries using this tool?

IMO every roundtripped plan in every Substrait consumer and/or producer should also be passed through the validator. Otherwise, how would you know for sure that the Substrait plan you've successfully roundtripped through is actually sensible in any way? It does always require a complete plan, though, so you'd need some or other function for each type of thing (expression, relation, etc) that surrounds the thing with a dummy plan. Arrow could hook into it via the C interface (it's not a very pleasant interface because it's intended to be compatible with any language that can call into C, so you might want to wrap it with some C++ stuff; also it will need a Rust compiler to build) or it could just execute the CLI on a generated file (more clunky, but that can just be pulled from PyPI in binary form, so it's probably a bit easier on CI).

I'm sure I'm biased though, since I'm the one who made the validator. It's also starting to considerably lag behind Substrait; it doesn't seem like anyone is sufficiently interested to review/collaborate, so I can't get any PRs through.

Link, just in case: https://github.com/substrait-io/substrait-validator

One doubtful thing is to check in serialization is whether a projection or filter expression is added or not/ differentiation from default values. For instance filter expression defaults to a boolean literal of value true.

Assuming you mean that in Acero the filter expression is mandatory and is just set to literal true if there is none, IMO you could just do the same thing on the Substrait side, at least for now. Likewise for the projection. Or you could just leave it for a later PR and error out when presented with nontrivial values. I don't know how hard any of these things are; I've never done anything with the Acero representation.

Jun 20 '22 10:06 jvanstraten

Do you think it’s wise to add a CI to test Substrait related queries using this tool?

IMO every roundtripped plan in every Substrait consumer and/or producer should also be passed through the validator. Otherwise, how would you know for sure that the Substrait plan you've successfully roundtripped through is actually sensible in any way? It does always require a complete plan, though, so you'd need some or other function for each type of thing (expression, relation, etc) that surrounds the thing with a dummy plan. Arrow could hook into it via the C interface (it's not a very pleasant interface because it's intended to be compatible with any language that can call into C, so you might want to wrap it with some C++ stuff; also it will need a Rust compiler to build) or it could just execute the CLI on a generated file (more clunky, but that can just be pulled from PyPI in binary form, so it's probably a bit easier on CI).

I'm sure I'm biased though, since I'm the one who made the validator. It's also starting to considerably lag behind Substrait; it doesn't seem like anyone is sufficiently interested to review/collaborate, so I can't get any PRs through.

Link, just in case: https://github.com/substrait-io/substrait-validator

Intersting thoughts. I will take a look at the tool. It would be better if we can use it to validate things. But I am not sure if it needs to be inside the Arrow source or should it be a plugin for Apache Arrow. cc @westonpace

One doubtful thing is to check in serialization is whether a projection or filter expression is added or not/ differentiation from default values. For instance filter expression defaults to a boolean literal of value true.

Assuming you mean that in Acero the filter expression is mandatory and is just set to literal true if there is none, IMO you could just do the same thing on the Substrait side, at least for now. Likewise for the projection. Or you could just leave it for a later PR and error out when presented with nontrivial values. I don't know how hard any of these things are; I've never done anything with the Acero representation.

Here it is rather, the differentiation between a user passed value vs the default. We could assume the default and do the comparison to see if an explicit value is passed. There is no API calls in Expression to check if it has_filter or has_projection. May be that kind of a function could be useful.

Jun 20 '22 14:06 vibhatha

Intersting thoughts. I will take a look at the tool. It would be better if we can use it to validate things. But I am not sure if it needs to be inside the Arrow source or should it be a plugin for Apache Arrow. cc @westonpace

I think it would make a lot of sense for unit tests to bring in the validator as a C dependency.

Jul 06 '22 00:07 westonpace

Intersting thoughts. I will take a look at the tool. It would be better if we can use it to validate things. But I am not sure if it needs to be inside the Arrow source or should it be a plugin for Apache Arrow. cc @westonpace

I think it would make a lot of sense for unit tests to bring in the validator as a C dependency.

Should we create a JIRA for this?

Jul 06 '22 02:07 vibhatha

A few more thoughts. Using parquet is fine. My only concern was the test data directory.

We got rid of that. By the way this could would remain very much same, but ths usage would be different once this is subjected to a registry usage in Substrait ToProto methods.

Jul 12 '22 00:07 vibhatha

** THIS PR IS UNDERGOING A REFACTOR **

Jul 13 '22 07:07 vibhatha

Thanks for this work! It'll be very useful in testing to have roundtrippable full plans. I do have some concerns wrt the large-scale structure of this patch, though:

Maybe I've missed something, but I don't see the necessity of wrapping this in a registry. We have a number of registries in arrow but it's a pattern with heavy overhead which we follow only when required by constraints of third party extensibility. In this case, since the protobuf message classes are a private/internal implementation detail, adding anything to this registry would require rebuilding arrow anyway. That being the case, please remove the registry and ensure that protobuf message classes remain internal.

Thank you @bkietz for the detailed review. I was inclined towards the registry with the idea of putting the Substrait-to-Acero and Acero-to-Substrait to be unified under a single registry. Not sure if it is a good idea. And this idea is not reflected in this PR.

We can easily move for a non-registry approach it is no issue.

cc @westonpace for more thoughts on this.

Aug 04 '22 16:08 vibhatha

I think the goal for a registry was to support conversion of custom exec nodes to Substrait. I agree that we don't want to expose Substrait internal classes though. @bkietz any ideas? This probably isn't the biggest priority at the moment if we can't find a way to support it.

Aug 04 '22 23:08 westonpace

Since this API is designed to be a test-only utility I wouldn't expect us to need to deal with custom exec nodes

Aug 08 '22 20:08 bkietz

Since this API is designed to be a test-only utility I wouldn't expect us to need to deal with custom exec nodes

@bkietz

That's a reasonable stance. I suppose I had hoped we could provide some utility to users with custom relations. I suppose the internal-ness of Substrait makes that pretty difficult however. Custom relations in Substrait use the any message type in protobuf. Is it legal to expose google/protobuf/any.pb.h?

If yes, then it seems we could still have a registry (or at least a fallback lambda) for custom extensions.

If no, then won't this be a concern for the other direction, e.g. deserializing Substrait plans with custom extensions?

Aug 08 '22 22:08 westonpace

I think this PR provides useful functionality for round trip testing. I also agree that we want to support custom relations (which will in all likelihood require a registry). I think trying to address both goals with the same API will compromise its efficacy for either.

Re extension relations: I think google.protobuf.Any is simple enough that we don't really need to expose it since it is just a buffer and a type_url. I'd propose the following structure:

class ExtensionRelationRegistry {
 public:
  using Factory = std::function<Result<Declaration>(const Buffer& bytes, std::vector<Declaration> inputs)>;

  enum RelationType {
    /// substrait.ExtensionTable: the relation is a literal and requires no processing to emit data
    kTable,
    /// substrait.ExtensionLeafRel: the relation has no inputs, for example may stream data from disk
    kLeaf,
    /// substrait.ExtensionSingleRel: the relation has a single input, for example may filter out rows
    kSingle,
    /// substrait.ExtensionMultiRel: the relation has multiple inputs, for example may be a HighFolutinJoin
    kMulti,
  };

  virtual Status AddFactory(std::string type_url, RelationType, Factory) = 0;
  virtual Result<Factory> GetFactory(const std::string& type_url, RelationType) = 0;
};

When adding the above registry, we can extend ToProto with a lambda fallback like struct { RelationType type; std::string type_url; std::shared_ptr<Buffer> bytes; } custom_to_proto(const Declaration&) which should be sufficient for round trip testing purposes.

Aug 09 '22 13:08 bkietz

@westonpace Re: https://github.com/apache/arrow/pull/13401#discussion_r956480272

The URL is like /var/<some_path>/T//data.parquet

Aug 29 '22 08:08 vibhatha

@westonpace Re: #13401 (comment)

The URL is like /var/<some_path>/T//data.parquet

@westonpace tracking down the issue

Tempdir path : /var/folders/vq/2v5vdv2n1t9b_fp4lp17h3d40000gn/T//3mpj4aud/
File Path : /var/folders/vq/2v5vdv2n1t9b_fp4lp17h3d40000gn/T//3mpj4aud/serde_test.parquet
/Users/vibhatha/github/fork/arrow/cpp/src/arrow/filesystem/path_util.cc:104:  Check failed: !stem.empty() 
0   libarrow.1000.0.0.dylib             0x0000000111b06aa0 _ZN5arrow4util7CerrLog14PrintBackTraceEv + 52
1   libarrow.1000.0.0.dylib             0x0000000111b06a54 _ZN5arrow4util7CerrLogD2Ev + 96
2   libarrow.1000.0.0.dylib             0x0000000111b069b4 _ZN5arrow4util7CerrLogD1Ev + 28
3   libarrow.1000.0.0.dylib             0x0000000111b069e0 _ZN5arrow4util7CerrLogD0Ev + 28
4   libarrow.1000.0.0.dylib             0x0000000111b068b0 _ZN5arrow4util8ArrowLogD2Ev + 84
5   libarrow.1000.0.0.dylib             0x0000000111b068ec _ZN5arrow4util8ArrowLogD1Ev + 28

When the tempdir is created, that path contains this /T// component in the URI. And this could be a MAC thing as I am developing on Mac M1. It is better to file a JIRA for this IMHO.

Should we fix it first and then fix this PR? Or put a macro with the ticket id to resolve it?

Edit:

PS: Trace it to here: https://github.com/apache/arrow/blob/359eab59bb12087dbddbdb1dd59ebf48bef42ab6/cpp/src/arrow/util/io_util.cc#L1870

I think the issue is there is a trailing slash guaranteed by Arrow filesystem and that is guaranteed for the generated temporary directory for the convenience. By reading through the test cases that is what I was able to gather. But somehow in Mac M1 where I compile my code that trailing space is implicitly added to the code. Or there is a wrong append that I am doing, but it is not the case as I am using Join to do a safe path concat (I am looking futher).

I think I understand what to be fixed or at least where to look. Feels like this is part of file_test.cc and localfs_test.cc tests. Without the trailing slash, these test cases fail. But with trailing slash this test case fails.

WDYT?

Aug 31 '22 01:08 vibhatha

@westonpace I added a fix for the path issue on Mac. I think now it is more generalized.

Any other suggestions?

Sep 07 '22 15:09 vibhatha

@westonpace I updated the PR. Seems like a few CIs are failing. But, it seems like not related to the changes applied here. WDYT?

Sep 08 '22 11:09 vibhatha

@westonpace Thanks a lot for keeping up with the major changes and a few rounds of reviews. 👍

Sep 08 '22 16:09 vibhatha

Benchmark runs are scheduled for baseline = 8fe7e35388a8147527037711e4262981fa81644a and contender = 74756051c4f6a8b13a40057f586817d56198d4ba. 74756051c4f6a8b13a40057f586817d56198d4ba is a master commit associated with this PR. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Finished :arrow_down:0.0% :arrow_up:0.0%] ec2-t3-xlarge-us-east-2 [Finished :arrow_down:0.31% :arrow_up:0.2%] test-mac-arm [Failed :arrow_down:0.0% :arrow_up:0.0%] ursa-i9-9960x [Finished :arrow_down:0.46% :arrow_up:0.0%] ursa-thinkcentre-m75q Buildkite builds: [Finished] 74756051 ec2-t3-xlarge-us-east-2 [Finished] 74756051 test-mac-arm [Failed] 74756051 ursa-i9-9960x [Finished] 74756051 ursa-thinkcentre-m75q [Finished] 8fe7e353 ec2-t3-xlarge-us-east-2 [Finished] 8fe7e353 test-mac-arm [Failed] 8fe7e353 ursa-i9-9960x [Finished] 8fe7e353 ursa-thinkcentre-m75q Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sep 08 '22 21:09 ursabot