orc icon indicating copy to clipboard operation
orc copied to clipboard

[C++] Unable to filter DECIMAL column from ORC file

Open karan-k-deepr opened this issue 4 years ago • 4 comments

This question is similar to THIS one I asked before on StackOverflow, which after some more trials it works.
Previously there was some issue with the column Id but now I am trying to filter a column of DECIMAL data type but always results give me all the data instead of the filtered one.

Data which ORC file has in the required columns:
enter image description here

And this is how I am trying to filter out the DECIMAL column using orc::SearchArgument:

orc::RowReaderOptions m_RowReaderOpts;
orc::ReaderOptions m_ReaderOpts;

std::unique_ptr<orc::Reader> m_Reader;
std::unique_ptr<orc::RowReader> m_RowReader;

auto builder = orc::SearchArgumentFactory::newBuilder();
const int snapshot_time_col_id = 22;

orc::Literal ss_begin_time{34080000000000, 14, 9};
orc::Literal ss_end_time{34380000000000, 14, 9};

// I HAVE ALSO TRIED, but didn't work.
// orc::Literal ss_begin_time{34080, 5, 0};
// orc::Literal ss_end_time{34380, 5, 0};

builder->between(snapshot_time_col_id, orc::PredicateDataType::DECIMAL, ss_begin_time, ss_end_time);

m_RowReaderOpts.searchArgument(builder->build());
reader = orc::createReader(orc::readFile(a_FilePath.c_str()), m_ReaderOpts);
row_reader = reader->createRowReader(m_RowReaderOpts);

Please give some suggestions on how to filter data of type DECIMAL?

karan-k-deepr avatar Jan 20 '22 12:01 karan-k-deepr

cc @wgtmac and @stiga-huang

dongjoon-hyun avatar Jan 21 '22 02:01 dongjoon-hyun

Any update on this bug?

karan-k-deepr avatar Apr 12 '22 18:04 karan-k-deepr

Could you verify if the whole batch returned by row_reader->next() violates the SearchArgument? If so, there are bugs. Otherwise, it's by design.

orc::SearchArgument is used as an indicator for the reader to skip unrelated RowGroups, i.e. it's only evaluated on RowGroup level (not row-level). If the reader can't filter out a RowGroup based on the SearchArgument, it will return all rows of that RowGroup. The caller is expected to filter out rows by itself.

stiga-huang avatar Apr 13 '22 01:04 stiga-huang

@stiga-huang I tried checking the min and max value of the batch received by the row_reader->next() command. And the batch it's returned doesn't filter anything for decimal values.

karan-k-deepr avatar Apr 25 '22 13:04 karan-k-deepr