parquet-java icon indicating copy to clipboard operation
parquet-java copied to clipboard

Is big Decimal in Parquet big endian?

Open 4ertus2 opened this issue 7 months ago • 4 comments

Describe the usage question you have. Please include as many useful details as possible.

It's not clear from description https://parquet.apache.org/docs/file-format/data-pages/encodings/ how are big Decimals (precision >= 18) placed. As I could understand they can be encoded as BYTE_ARRAY or FIXED_LEN_BYTE_ARRAY. But there're two ways to place "the bytes contained in the array" from memory: big endian and little endian. I.e. they are in LE in int128 in memory. Could I place the bytes into FIXED_LEN_BYTE_ARRAY as is?

It's also not clean if there's some encoding over these decimal's bytes, should the data be swapped to correct endianness before of after the encoding?

Component(s)

No response

4ertus2 avatar Jun 27 '25 14:06 4ertus2

Hi @4ertus2, looking at the code, the Parquet Writer does not do anything special with high-precision decimal, the raw bytes of the un-scaled integer are written through BYTE_ARRAY and read back the same way (https://github.com/apache/parquet-java/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/plain/PlainValuesWriter.java#L52). Since the array is treated as an opaque blob, byte swapping would not be needed. I'll raise a change to add some comments in the code

ArnavBalyan avatar Aug 18 '25 05:08 ArnavBalyan

For BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY, the unscaled number must be encoded as two's complement using big-endian byte order (the most significant byte is the zeroth element)

This is clear per the spec: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal

wgtmac avatar Aug 24 '25 08:08 wgtmac

This is clear per the spec: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal

I didn't find it there. Is this md a point of truth for specs? Any case it's not in sync with https://parquet.apache.org/docs/file-format/types/logicaltypes/

The question is mainly about the docs. I've already found that it needs a bswap128 in code.

4ertus2 avatar Aug 24 '25 23:08 4ertus2

Yes, those markdown files are the source of truth for specs. The site is unfortunately out of sync and we had a discussion to remove the spec from site by linking to the markdown files.

wgtmac avatar Aug 25 '25 02:08 wgtmac