Closes https://github.com/apache/parquet-testing/issues/75
Related to https://github.com/apache/arrow-rs/pull/7404

Rationale

Per the parquet mailing list and the issue https://github.com/apache/parquet-testing/issues/75 it seems that Spark is currently the only open source implementation of Variant available. All tests I could in the spark codebase test the code by roundtripping to JSON rather than using well known binary examples

To facilitate implementations in other languages and systems (such as Rust in arrow-rs) we need binary artifacts to ensure compatibility.

Changes

This PR adds

example binary variant data, for primitive as well as short_string, object and array tpes
The script used to generate the data
Documentation

TODO:

[ ] Manually verify binary encodings
[ ] File follow on tickets for creating variants from data that is not JSON encodeable (e.g. with timestamp fields)
[ ] File follow on ticket for creating larger objects / arrays that require different offset lengths

Apr 14 '25 11:04 alamb

I think this is ready for a look. I have spot checked the actual binary values that came out (though I haven't manually checked all of them) and they look as expected

If this format is acceptable I will double check all the values manually

Apr 16 '25 14:04 alamb

🦗

Apr 28 '25 15:04 alamb

Today at the Parquet sync @emkornfield said he might have some time to review this PR. If you don't have time, perhaps you can suggest some other people who might be able to review

Apr 30 '25 17:04 alamb

Thank you @emkornfield -- I will address your comments shortly and manually review the binary values

Apr 30 '25 19:04 alamb

I manually reviewed the binary encodings for primitive types and they match VariantEncoding.md as far as I can tell.

I am actually having trouble manually verifying the nested object metadata I will continue to investigate

I did verify that using pyspark built from main as of today still generates the same variant binary values

May 02 '25 15:05 alamb

Thank you for the review @RussellSpitzer

May 02 '25 21:05 alamb

LGTM. thank you @alamb to taking the initiative in driving this forward.

May 03 '25 06:05 emkornfield

@alamb I noticed that:

decimal is named as {4|8|16} not {32|64|128}
Null object metadata is empty, is this expected?

May 12 '25 08:05 mapleFU

@alamb I noticed that:

decimal is named as {4|8|16} not {32|64|128}

I tried to follow the naming in the table from VariantEncoding.md which uses those terms

Exact Numeric	decimal4	8	DECIMAL(precision, scale)	1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table)
Exact Numeric	decimal8	9	DECIMAL(precision, scale)	1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table)
Exact Numeric	decimal16	10	DECIMAL(precision, scale)	1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table)

Null object metadata is empty, is this expected?

This is probably not right -- it is likely an artifact of how spark wrote the parquet file (probably with a parquet null rather than a null in the object). I filed a ticket to track it:

https://github.com/apache/parquet-testing/issues/81 to track

May 12 '25 10:05 alamb

Add example binary variant data and regeneration scripts

Rationale

Changes

TODO: