orc icon indicating copy to clipboard operation
orc copied to clipboard

Java impl does not fully adhere to the usage contract of `ColumnVector.noNulls` and `ColumnVector.isNull`

Open guiyanakuang opened this issue 2 years ago • 4 comments

The comments in ColumnVector indicate that we should only use the isNull array to determine null values when noNulls is set to false. This helps us reuse ColumnVector, for example, during reuse when noNulls transitions from true to true or false to true, we don't need to set isNull, which can help improve performance.

Unfortunately, there are some counterexamples in the Java impl, such as in BitFieldReader.java, where we directly read isNull without checking noNulls first. To ensure correctness, Java resets noNulls and isNull every time, as seen in TreeReaderFactory.java.

This issue originates from the discussion of ORC-1408, whether the ORC Java version needs to modify all counterexamples to adhere to the contract. I'm inclined to make the changes to be consistent with the C++ impl and to avoid unnecessary isNull setting, although I haven't verified how much performance improvement it could bring.

guiyanakuang avatar Apr 15 '23 06:04 guiyanakuang

cc @williamhyun @dongjoon-hyun @wgtmac

guiyanakuang avatar Apr 15 '23 06:04 guiyanakuang

IMHO, it is safe to make the change if the writer and reader can promise the contract to the downstream consumers (e.g. Apache Spark and Apache Hive). The ORC java library can break the promise in the internal implementation.

wgtmac avatar Apr 15 '23 13:04 wgtmac

+1 for the proposal.

It seems that there is no compatibility issue, just a performance optimization and a more standardized use of Hive classes.

deshanxiao avatar Apr 17 '23 03:04 deshanxiao

It would be great to standardize this interaction across the project with respect to the contract. Agree with the proposal that as long as the contract with external consumers is not affected we can work on internal standardization/consistency.

Regards, Pavan

On Apr 15, 2023, at 6:07 AM, Gang Wu @.***> wrote:

IMHO, it is safe to make the change if the writer and reader can promise the contract to the downstream consumers (e.g. Apache Spark and Apache Hive). The ORC java library can break the promise in the internal implementation.

— Reply to this email directly, view it on GitHub https://github.com/apache/orc/issues/1470#issuecomment-1509817969, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABYDHO3E55QXFBW5YYD2DFTXBKMSFANCNFSM6AAAAAAW7G4FZM. You are receiving this because you are subscribed to this thread.

pavibhai avatar Apr 17 '23 15:04 pavibhai