parquet-java icon indicating copy to clipboard operation
parquet-java copied to clipboard

PARQUET-1711: Break circular dependencies in proto definitions

Open matthieun opened this issue 3 years ago • 8 comments

In case some proto definitions have circular dependencies, the proto schema converter breaks those and logs a warning, instead of a StackOverflowException.

Jira

  • [x] My PR addresses the following Parquet Jira issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR"
    • https://issues.apache.org/jira/browse/PARQUET-1711

Tests

  • [x] My PR adds the following unit tests OR does not need testing for this extremely good reason:
    • Proto definitions with circular dependencies tested in ProtoSchemaConverterTest

Commits

  • [x] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • [ ] In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

matthieun avatar Aug 18 '22 20:08 matthieun

@shangxinli Let me know if this is good to merge!

matthieun avatar Aug 26 '22 23:08 matthieun

hmm... what timing. i actually have a pr for what i think is a more robust approach that truncates at an arbitrary recursion depth by putting the remaining recursion levels into a binary blob. this approach lets downstream querying things query the non-truncated parts fine, and allows for udfs to be defined to reinstantiate the truncated recursed fields.

i didn't submit the pr for merge quite yet b/c i'm busy trying to finish off the overall project i needed this for at work, so it's just coded against 1.12.3 and not head.

ptal, and if everyone likes my proposal, i can spend a few cycles and move it to head:

schema converter pr:

  • https://github.com/promotedai/parquet-mr/pull/1 write support pr:
  • https://github.com/promotedai/parquet-mr/pull/2

jinyius avatar Aug 31 '22 05:08 jinyius

fyi, i sent pr #995

jinyius avatar Sep 08 '22 07:09 jinyius

@matthieun and @jinyius Would it be possible for you both to sync to come up with one solution? You can put the other one as co-author.

shangxinli avatar Sep 27 '22 02:09 shangxinli

@matthieun and @jinyius Would it be possible for you both to sync to come up with one solution? You can put the other one as co-author.

imho, i believe #995 is a superset of functionality to this pr.

jinyius avatar Sep 28 '22 05:09 jinyius

Hi @jinyius and @matthieun, Thank both of you for the contribution and we really appreciate your patience with us. Now we have two PRs for the same issue, we better merge them into one. Given this PR is earlier, would it be a good idea to incorporate https://github.com/apache/parquet-mr/pull/995 into this PR for what is missing? @matthieun can add @jinyius as a co-author in that case.

Does it make sense to both of you?

shangxinli avatar Oct 09 '22 20:10 shangxinli

Hi @jinyius and @matthieun, Thank both of you for the contribution and we really appreciate your patience with us. Now we have two PRs for the same issue, we better merge them into one. Given this PR is earlier, would it be a good idea to incorporate #995 into this PR for what is missing? @matthieun can add @jinyius as a co-author in that case.

Does it make sense to both of you?

i don't think merging will help here. both approaches do similar things in terms of traversing and expanding out the schema on recursive fields. the differ on the state used during the traversal, and they differ on how to deal with the remaining recursive data (this one silently ignores, but the mine stores as serialized bytes).

i don't care about authorship. i want this to get fixed, and fixed properly.

jinyius avatar Oct 10 '22 04:10 jinyius

Hi, I am fine with whatever solution. If you choose #995 that works, please just close this one!

matthieun avatar Oct 10 '22 17:10 matthieun

Since https://github.com/apache/parquet-mr/pull/995 is merged, let's close this one. Thanks @matthieun for the contribution !

shangxinli avatar Dec 03 '22 18:12 shangxinli