PARQUET-1711: Break circular dependencies in proto definitions
In case some proto definitions have circular dependencies, the proto schema converter breaks those and logs a warning, instead of a StackOverflowException.
Jira
- [x] My PR addresses the following Parquet Jira issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR"
- https://issues.apache.org/jira/browse/PARQUET-1711
Tests
- [x] My PR adds the following unit tests OR does not need testing for this extremely good reason:
- Proto definitions with circular dependencies tested in
ProtoSchemaConverterTest
- Proto definitions with circular dependencies tested in
Commits
- [x] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
- Subject is separated from body by a blank line
- Subject is limited to 50 characters (not including Jira issue reference)
- Subject does not end with a period
- Subject uses the imperative mood ("add", not "adding")
- Body wraps at 72 characters
- Body explains "what" and "why", not "how"
Documentation
- [ ] In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain Javadoc that explain what it does
@shangxinli Let me know if this is good to merge!
hmm... what timing. i actually have a pr for what i think is a more robust approach that truncates at an arbitrary recursion depth by putting the remaining recursion levels into a binary blob. this approach lets downstream querying things query the non-truncated parts fine, and allows for udfs to be defined to reinstantiate the truncated recursed fields.
i didn't submit the pr for merge quite yet b/c i'm busy trying to finish off the overall project i needed this for at work, so it's just coded against 1.12.3 and not head.
ptal, and if everyone likes my proposal, i can spend a few cycles and move it to head:
schema converter pr:
- https://github.com/promotedai/parquet-mr/pull/1 write support pr:
- https://github.com/promotedai/parquet-mr/pull/2
fyi, i sent pr #995
@matthieun and @jinyius Would it be possible for you both to sync to come up with one solution? You can put the other one as co-author.
@matthieun and @jinyius Would it be possible for you both to sync to come up with one solution? You can put the other one as co-author.
imho, i believe #995 is a superset of functionality to this pr.
Hi @jinyius and @matthieun, Thank both of you for the contribution and we really appreciate your patience with us. Now we have two PRs for the same issue, we better merge them into one. Given this PR is earlier, would it be a good idea to incorporate https://github.com/apache/parquet-mr/pull/995 into this PR for what is missing? @matthieun can add @jinyius as a co-author in that case.
Does it make sense to both of you?
Hi @jinyius and @matthieun, Thank both of you for the contribution and we really appreciate your patience with us. Now we have two PRs for the same issue, we better merge them into one. Given this PR is earlier, would it be a good idea to incorporate #995 into this PR for what is missing? @matthieun can add @jinyius as a co-author in that case.
Does it make sense to both of you?
i don't think merging will help here. both approaches do similar things in terms of traversing and expanding out the schema on recursive fields. the differ on the state used during the traversal, and they differ on how to deal with the remaining recursive data (this one silently ignores, but the mine stores as serialized bytes).
i don't care about authorship. i want this to get fixed, and fixed properly.
Hi, I am fine with whatever solution. If you choose #995 that works, please just close this one!
Since https://github.com/apache/parquet-mr/pull/995 is merged, let's close this one. Thanks @matthieun for the contribution !