RecordReaderImpl.getValueRange() may cause incorrect results
orc version: 1.6.11, sql: select xxx from xxx where str is not null
Recently i found some orc files wrote by trino didn't have complete statistics in files meta(maybe a presto bug), this causes OrcProto.ColumnStatistics can't be deserialized to any specific ColumnStatisticsImpl such as StringStatisticsImpl, then RecordReaderImpl.getValueRange() returns ValueRange with null lower and RecordReaderImpl.pickRowGroups() skips this row group, which should not be skipped. In normal conditions except above, everything is ok. And i found orc-1.5.x can handle above case according to RecordReaderImpl.UNKNOWN_VALUE, which has removed in 1.6.x. Maybe we could add it back for better compatibility. @dongjoon-hyun @omalley
Thank you for reporting, @PengleiShi .
- Ya, I've heard that there exists ORC writers that doesn't generate statistics properly.
- Could you provide some sample ORC files?
AFAIK, this doesn't happen between Apache ORC writer and reader, right? @PengleiShi
AFAIK, this doesn't happen between Apache ORC writer and reader, right? @PengleiShi
Yes, it doesn't. In the case i mentioned, the files were wrote by trino(which has own orc writer) and read by spark(which depends on Apache ORC reader).
Thank you for reporting, @PengleiShi .
- Could you provide some sample ORC files?
Most of files wrote by trino have proper statistics. I will try to re-generate some problem orc files which can be public.
The meta of problem files is below:

@dongjoon-hyun. Trino won't write string column statistics if string value is bigger than 64 bytes
Here is a orc file wrote by trino and contains only one row
Test with spark3.2
select * from xxx;
select * from xxx where name is not null;
20220310_100444_03858_nbvwj_53625cc9-7183-4beb-be48-9d059d8fa560.zip
@PengleiShi do you mind sharing the stats of the problematic case above? We currently trim StringStatistics to 1024 chars in the default writer https://issues.apache.org/jira/browse/ORC-203 I believe Presto should follow a similar logic. In addition, on the Reader path we probably want to avoid skipping RowGroups when facing problematic/null ValueRange stats.
@dongjoon-hyun. Trino won't write string column statistics if string value is bigger than 64 bytes
Here is a orc file wrote by trino and contains only one row
Test with spark3.2
select * from xxx; select * from xxx where name is not null;20220310_100444_03858_nbvwj_53625cc9-7183-4beb-be48-9d059d8fa560.zip
@pgaref here i have uploaded a problematic file for test. Its meta shows below
