feat(core): Add support for `_file` column
Which issue does this PR close?
- Closes #1766.
What changes are included in this PR?
Integrates virtual field handling for the _file metadata column into RecordBatchTransformer using a pre-computed constants map, eliminating post-processing and duplicate lookups.
Key Changes
New metadata_columns.rs module: Centralized utilities for metadata columns
- Constants:
RESERVED_FIELD_ID_FILE,RESERVED_COL_NAME_FILE - Helper functions:
get_metadata_column_name(),get_metadata_field_id(),is_metadata_field(),is_metadata_column_name()
Enhanced RecordBatchTransformer:
- Added
constant_fields: HashMap<i32, (DataType, PrimitiveLiteral)>- pre-computed during initialization - New
with_constant()method - computes Arrow type once during setup - Updated to use pre-computed types and values (avoids duplicate lookups)
- Handles
DataType::RunEndEncodedfor constant strings (memory efficient)
Simplified reader.rs:
- Pass full
project_field_ids(including virtual) to RecordBatchTransformer - Single
with_constant()call to register_filecolumn - Removed post-processing loop
Updated scan/mod.rs:
- Use
is_metadata_column_name()andget_metadata_field_id()instead of hardcoded checks
Are these changes tested?
Yes, comprehensive tests have been added to verify the functionality:
New Tests (7 tests added)
Table Scan API Tests (7 tests)
-
test_select_with_file_column- Verifies basic functionality of selecting_filewith regular columns -
test_select_file_column_position- Verifies column ordering is preserved -
test_select_file_column_only- Tests selecting only the_filecolumn -
test_file_column_with_multiple_files- Tests multiple data files scenario -
test_file_column_at_start- Tests_fileat position 0 -
test_file_column_at_end- Tests_fileat the last position -
test_select_with_repeated_column_names- Tests repeated column selection
Thanks @gbrgr for this pr. But I think we need to rethink how to compute the
_file,_posmetadata column. While it's somehow trivial to compute_file, it's non trivial to compute_posefficient, since when we read parquet files, we have filtered out some row groups. I think the best way is to push reading these two columns to arrow-rs.
@liurenjie1024 I agree for _pos, and we have a PR there: https://github.com/apache/arrow-rs/pull/8715
But _file seems like something that we don't need the arrow-rs to know about. Similarly, in future, for _row_id from V3 spec, we cannot expect arrow-rs to be responsible for computing that one.
How do we go forward with rethinking this, what would be the action items for us?
Thanks @gbrgr for this pr. But I think we need to rethink how to compute the
_file,_posmetadata column. While it's somehow trivial to compute_file, it's non trivial to compute_posefficient, since when we read parquet files, we have filtered out some row groups. I think the best way is to push reading these two columns to arrow-rs.@liurenjie1024 I agree for
_pos, and we have a PR there: apache/arrow-rs#8715 But_fileseems like something that we don't need the arrow-rs to know about. Similarly, in future, for_row_idfrom V3 spec, we cannot expect arrow-rs to be responsible for computing that one.How do we go forward with rethinking this, what would be the action items for us?
Hi, @vustef I also agree that we should put _file in iceberg-rust, and I left some comments about how to proceed.
@liurenjie1024 I now resolved the merge conflicts that stem from PR #1821:
- I removed partition information stored in the
RecordBatchTransformer, and instead generically store constant fields in it which can be used to add column sources. - I changed all added columns to be REE encoded, as they are all constant values.
@vustef I changed the PR in the sense that default values are not REE encoded anymore, but only constant fields that come from added metadata fields + partition data.
@liurenjie1024 let us know whether that is OK. If REE is desired in general for all constant columns, I guess it is better to make a follow-up PR to keep changesets smaller.
@liurenjie1024 just a friendly ping on this for the new round of your feedback, if you have time.
Hi, @gbrgr are you stilling working on this? This is a blocker of 0.8.0 release.
@liurenjie1024 yes I will incorporate the feedback today