Is big Decimal in Parquet big endian?
Describe the usage question you have. Please include as many useful details as possible.
It's not clear from description https://parquet.apache.org/docs/file-format/data-pages/encodings/ how are big Decimals (precision >= 18) placed. As I could understand they can be encoded as BYTE_ARRAY or FIXED_LEN_BYTE_ARRAY. But there're two ways to place "the bytes contained in the array" from memory: big endian and little endian. I.e. they are in LE in int128 in memory. Could I place the bytes into FIXED_LEN_BYTE_ARRAY as is?
It's also not clean if there's some encoding over these decimal's bytes, should the data be swapped to correct endianness before of after the encoding?
Component(s)
No response
Hi @4ertus2, looking at the code, the Parquet Writer does not do anything special with high-precision decimal, the raw bytes of the un-scaled integer are written through BYTE_ARRAY and read back the same way (https://github.com/apache/parquet-java/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/plain/PlainValuesWriter.java#L52). Since the array is treated as an opaque blob, byte swapping would not be needed. I'll raise a change to add some comments in the code
For BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY, the unscaled number must be encoded as two's complement using big-endian byte order (the most significant byte is the zeroth element)
This is clear per the spec: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal
This is clear per the spec: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal
I didn't find it there. Is this md a point of truth for specs? Any case it's not in sync with https://parquet.apache.org/docs/file-format/types/logicaltypes/
The question is mainly about the docs. I've already found that it needs a bswap128 in code.
Yes, those markdown files are the source of truth for specs. The site is unfortunately out of sync and we had a discussion to remove the spec from site by linking to the markdown files.