iceberg-rust icon indicating copy to clipboard operation
iceberg-rust copied to clipboard

feat(core): Add support for `_file` column

Open gbrgr opened this issue 2 months ago • 3 comments

Which issue does this PR close?

  • Closes #1766.

What changes are included in this PR?

Integrates virtual field handling for the _file metadata column into RecordBatchTransformer using a pre-computed constants map, eliminating post-processing and duplicate lookups.

Key Changes

New metadata_columns.rs module: Centralized utilities for metadata columns

  • Constants: RESERVED_FIELD_ID_FILE, RESERVED_COL_NAME_FILE
  • Helper functions: get_metadata_column_name(), get_metadata_field_id(), is_metadata_field(), is_metadata_column_name()

Enhanced RecordBatchTransformer:

  • Added constant_fields: HashMap<i32, (DataType, PrimitiveLiteral)> - pre-computed during initialization
  • New with_constant() method - computes Arrow type once during setup
  • Updated to use pre-computed types and values (avoids duplicate lookups)
  • Handles DataType::RunEndEncoded for constant strings (memory efficient)

Simplified reader.rs:

  • Pass full project_field_ids (including virtual) to RecordBatchTransformer
  • Single with_constant() call to register _file column
  • Removed post-processing loop

Updated scan/mod.rs:

  • Use is_metadata_column_name() and get_metadata_field_id() instead of hardcoded checks

Are these changes tested?

Yes, comprehensive tests have been added to verify the functionality:

New Tests (7 tests added)

Table Scan API Tests (7 tests)

  1. test_select_with_file_column - Verifies basic functionality of selecting _file with regular columns
  2. test_select_file_column_position - Verifies column ordering is preserved
  3. test_select_file_column_only - Tests selecting only the _file column
  4. test_file_column_with_multiple_files - Tests multiple data files scenario
  5. test_file_column_at_start - Tests _file at position 0
  6. test_file_column_at_end - Tests _file at the last position
  7. test_select_with_repeated_column_names - Tests repeated column selection

gbrgr avatar Nov 04 '25 09:11 gbrgr

Thanks @gbrgr for this pr. But I think we need to rethink how to compute the _file, _pos metadata column. While it's somehow trivial to compute _file, it's non trivial to compute _pos efficient, since when we read parquet files, we have filtered out some row groups. I think the best way is to push reading these two columns to arrow-rs.

@liurenjie1024 I agree for _pos, and we have a PR there: https://github.com/apache/arrow-rs/pull/8715 But _file seems like something that we don't need the arrow-rs to know about. Similarly, in future, for _row_id from V3 spec, we cannot expect arrow-rs to be responsible for computing that one.

How do we go forward with rethinking this, what would be the action items for us?

vustef avatar Nov 06 '25 10:11 vustef

Thanks @gbrgr for this pr. But I think we need to rethink how to compute the _file, _pos metadata column. While it's somehow trivial to compute _file, it's non trivial to compute _pos efficient, since when we read parquet files, we have filtered out some row groups. I think the best way is to push reading these two columns to arrow-rs.

@liurenjie1024 I agree for _pos, and we have a PR there: apache/arrow-rs#8715 But _file seems like something that we don't need the arrow-rs to know about. Similarly, in future, for _row_id from V3 spec, we cannot expect arrow-rs to be responsible for computing that one.

How do we go forward with rethinking this, what would be the action items for us?

Hi, @vustef I also agree that we should put _file in iceberg-rust, and I left some comments about how to proceed.

liurenjie1024 avatar Nov 10 '25 01:11 liurenjie1024

@liurenjie1024 I now resolved the merge conflicts that stem from PR #1821:

  • I removed partition information stored in the RecordBatchTransformer, and instead generically store constant fields in it which can be used to add column sources.
  • I changed all added columns to be REE encoded, as they are all constant values.

gbrgr avatar Nov 14 '25 08:11 gbrgr

@vustef I changed the PR in the sense that default values are not REE encoded anymore, but only constant fields that come from added metadata fields + partition data.

@liurenjie1024 let us know whether that is OK. If REE is desired in general for all constant columns, I guess it is better to make a follow-up PR to keep changesets smaller.

gbrgr avatar Nov 17 '25 12:11 gbrgr

@liurenjie1024 just a friendly ping on this for the new round of your feedback, if you have time.

vustef avatar Nov 24 '25 13:11 vustef

Hi, @gbrgr are you stilling working on this? This is a blocker of 0.8.0 release.

liurenjie1024 avatar Dec 09 '25 03:12 liurenjie1024

@liurenjie1024 yes I will incorporate the feedback today

gbrgr avatar Dec 09 '25 05:12 gbrgr