orc RecordReaderImpl.getValueRange() may cause incorrect results

orc version: 1.6.11, sql: select xxx from xxx where str is not null

Recently i found some orc files wrote by trino didn't have complete statistics in files meta(maybe a presto bug), this causes OrcProto.ColumnStatistics can't be deserialized to any specific ColumnStatisticsImpl such as StringStatisticsImpl, then RecordReaderImpl.getValueRange() returns ValueRange with null lower and RecordReaderImpl.pickRowGroups() skips this row group, which should not be skipped. In normal conditions except above, everything is ok. And i found orc-1.5.x can handle above case according to RecordReaderImpl.UNKNOWN_VALUE, which has removed in 1.6.x. Maybe we could add it back for better compatibility. @dongjoon-hyun @omalley

Mar 09 '22 13:03 PengleiShi

Thank you for reporting, @PengleiShi .

Ya, I've heard that there exists ORC writers that doesn't generate statistics properly.
Could you provide some sample ORC files?

Mar 10 '22 00:03 dongjoon-hyun

AFAIK, this doesn't happen between Apache ORC writer and reader, right? @PengleiShi

Mar 10 '22 01:03 dongjoon-hyun

AFAIK, this doesn't happen between Apache ORC writer and reader, right? @PengleiShi

Yes, it doesn't. In the case i mentioned, the files were wrote by trino(which has own orc writer) and read by spark(which depends on Apache ORC reader).

Mar 10 '22 03:03 PengleiShi

Thank you for reporting, @PengleiShi .

Could you provide some sample ORC files?

Most of files wrote by trino have proper statistics. I will try to re-generate some problem orc files which can be public. The meta of problem files is below:

Mar 10 '22 03:03 PengleiShi

@dongjoon-hyun. Trino won't write string column statistics if string value is bigger than 64 bytes wecom-temp-6a63fc2f2a72e176c2d1fc77699f880b Here is a orc file wrote by trino and contains only one row Test with spark3.2

select * from xxx; 
select * from xxx where name is not null;

20220310_100444_03858_nbvwj_53625cc9-7183-4beb-be48-9d059d8fa560.zip

Mar 10 '22 12:03 PengleiShi

@PengleiShi do you mind sharing the stats of the problematic case above? We currently trim StringStatistics to 1024 chars in the default writer https://issues.apache.org/jira/browse/ORC-203 I believe Presto should follow a similar logic. In addition, on the Reader path we probably want to avoid skipping RowGroups when facing problematic/null ValueRange stats.

Mar 10 '22 18:03 pgaref

@dongjoon-hyun. Trino won't write string column statistics if string value is bigger than 64 bytes Here is a orc file wrote by trino and contains only one row Test with spark3.2
select * from xxx; 
select * from xxx where name is not null;
20220310_100444_03858_nbvwj_53625cc9-7183-4beb-be48-9d059d8fa560.zip

@pgaref here i have uploaded a problematic file for test. Its meta shows below

Mar 11 '22 03:03 PengleiShi