parquet-testing icon indicating copy to clipboard operation
parquet-testing copied to clipboard

Add example binary variant data and regeneration scripts

Open alamb opened this issue 9 months ago • 1 comments

  • Closes https://github.com/apache/parquet-testing/issues/75
  • Related to https://github.com/apache/arrow-rs/pull/7404

Rationale

Per the parquet mailing list and the issue https://github.com/apache/parquet-testing/issues/75 it seems that Spark is currently the only open source implementation of Variant available. All tests I could in the spark codebase test the code by roundtripping to JSON rather than using well known binary examples

To facilitate implementations in other languages and systems (such as Rust in arrow-rs) we need binary artifacts to ensure compatibility.

Changes

This PR adds

  1. example binary variant data, for primitive as well as short_string, object and array tpes
  2. The script used to generate the data
  3. Documentation

TODO:

  • [ ] Manually verify binary encodings
  • [ ] File follow on tickets for creating variants from data that is not JSON encodeable (e.g. with timestamp fields)
  • [ ] File follow on ticket for creating larger objects / arrays that require different offset lengths

alamb avatar Apr 14 '25 11:04 alamb

I think this is ready for a look. I have spot checked the actual binary values that came out (though I haven't manually checked all of them) and they look as expected

If this format is acceptable I will double check all the values manually

alamb avatar Apr 16 '25 14:04 alamb

🦗

alamb avatar Apr 28 '25 15:04 alamb

Today at the Parquet sync @emkornfield said he might have some time to review this PR. If you don't have time, perhaps you can suggest some other people who might be able to review

alamb avatar Apr 30 '25 17:04 alamb

Thank you @emkornfield -- I will address your comments shortly and manually review the binary values

alamb avatar Apr 30 '25 19:04 alamb

I manually reviewed the binary encodings for primitive types and they match VariantEncoding.md as far as I can tell.

I am actually having trouble manually verifying the nested object metadata I will continue to investigate

I did verify that using pyspark built from main as of today still generates the same variant binary values

alamb avatar May 02 '25 15:05 alamb

Thank you for the review @RussellSpitzer

alamb avatar May 02 '25 21:05 alamb

LGTM. thank you @alamb to taking the initiative in driving this forward.

emkornfield avatar May 03 '25 06:05 emkornfield

@alamb I noticed that:

  1. decimal is named as {4|8|16} not {32|64|128}
  2. Null object metadata is empty, is this expected?

mapleFU avatar May 12 '25 08:05 mapleFU

@alamb I noticed that:

  1. decimal is named as {4|8|16} not {32|64|128}

I tried to follow the naming in the table from VariantEncoding.md which uses those terms

Exact Numeric decimal4 8 DECIMAL(precision, scale) 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table)
Exact Numeric decimal8 9 DECIMAL(precision, scale) 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table)
Exact Numeric decimal16 10 DECIMAL(precision, scale) 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table)
  1. Null object metadata is empty, is this expected?

This is probably not right -- it is likely an artifact of how spark wrote the parquet file (probably with a parquet null rather than a null in the object). I filed a ticket to track it:

  • https://github.com/apache/parquet-testing/issues/81 to track

alamb avatar May 12 '25 10:05 alamb