Add example binary variant data and regeneration scripts
- Closes https://github.com/apache/parquet-testing/issues/75
- Related to https://github.com/apache/arrow-rs/pull/7404
Rationale
Per the parquet mailing list and the issue https://github.com/apache/parquet-testing/issues/75 it seems that Spark is currently the only open source implementation of Variant available. All tests I could in the spark codebase test the code by roundtripping to JSON rather than using well known binary examples
To facilitate implementations in other languages and systems (such as Rust in arrow-rs) we need binary artifacts to ensure compatibility.
Changes
This PR adds
- example binary variant data, for primitive as well as short_string, object and array tpes
- The script used to generate the data
- Documentation
TODO:
- [ ] Manually verify binary encodings
- [ ] File follow on tickets for creating variants from data that is not JSON encodeable (e.g. with timestamp fields)
- [ ] File follow on ticket for creating larger objects / arrays that require different offset lengths
I think this is ready for a look. I have spot checked the actual binary values that came out (though I haven't manually checked all of them) and they look as expected
If this format is acceptable I will double check all the values manually
🦗
Today at the Parquet sync @emkornfield said he might have some time to review this PR. If you don't have time, perhaps you can suggest some other people who might be able to review
Thank you @emkornfield -- I will address your comments shortly and manually review the binary values
I manually reviewed the binary encodings for primitive types and they match VariantEncoding.md as far as I can tell.
I am actually having trouble manually verifying the nested object metadata I will continue to investigate
I did verify that using pyspark built from main as of today still generates the same variant binary values
Thank you for the review @RussellSpitzer
LGTM. thank you @alamb to taking the initiative in driving this forward.
@alamb I noticed that:
- decimal is named as {4|8|16} not {32|64|128}
- Null object metadata is empty, is this expected?
@alamb I noticed that:
- decimal is named as {4|8|16} not {32|64|128}
I tried to follow the naming in the table from VariantEncoding.md which uses those terms
| Exact Numeric | decimal4 | 8 | DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table) |
|---|---|---|---|---|
| Exact Numeric | decimal8 | 9 | DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table) |
| Exact Numeric | decimal16 | 10 | DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table) |
- Null object metadata is empty, is this expected?
This is probably not right -- it is likely an artifact of how spark wrote the parquet file (probably with a parquet null rather than a null in the object). I filed a ticket to track it:
- https://github.com/apache/parquet-testing/issues/81 to track