Core: Add readable metrics columns to files metadata tables
Closes #4362
This adds following columns to all files tables:
- column_sizes_metrics
- value_counts_metrics
- null_value_counts_metrics
- nan_value_counts_metrics
- lower_bounds_metrics
- upper_bounds_metrics
This is to keep backward compatibility as the existing metrics columns can not be changed.
The first four return Map<String, Long>. Key is the human-readable column name (dot separated for nested columns). The last two return Map<String, String>. Key is like above, Value is human-readable upper/lower bound.
Example:
- value_counts_metrics = Map("mystruct.timestamp" => 1000)
- upper_bounds_metrics = Map("mystruct.timestamp" => "1970-01-01T00:00:00.000002")
This makes Iceberg metadata tables is a bit closer to Trino, where the last two columns are <Long, String> (column id to human readable bound). It goes beyond and even resolves the column to make it readable.
Implementation detail: Not that we add new columns to files table, the rows of the table are no longer a 1-to-1 mapping with the "DataFile" java object getters, so we have to add code to handle column mapping in projection case.
All Spark tests are updated/fixed now.
Fyi @RussellSpitzer @aokolnychyi @rdblue if you guys have time to leave some feedback. There are new tests added but its a bit big to show 'Files Changed': TestMetadataTableMetricsColumns.java
It seems like a great idea to add readable metrics. It is hard to make sense of them otherwise.
@szehon-ho, what do you think about adding a single map column, let's say called readable_metrics, that will hold a mapping from a column name into a struct that would represent metrics? The type will be Map<String, StructType> and we will have individual struct fields for each type of metric.
We can then easily access them via SQL.
SELECT readable_metrics['col1'].lower_bound FROM db.t.files
I am okay with individual columns too but it seems a bit cleaner to just have one.
Let me check in a bit.
Let me take a look today.
Let me take a look now.
Added additional test, looks it is working even when readable_metric column is selected before other columns (spark somehow calls the rows in their original order)
Really nice PR, thanks @szehon-ho and @aokolnychyi for the effort! When can we merge this? I think it is ready and has been two months since the last review, which will lead to more conflicts if leave it.
updated and rebased, @RussellSpitzer if you have time to take a look as well
Update. chatted offline with @RussellSpitzer will spend a few days if its possible to make the type dynamic struct instead of static map, to get the right types for lower, upper bounds.
@RussellSpitzer i think it should be as we discussed now, the readable_metrics is a dynamic type of column metrics for all primitive columns by qualified name. Each column metric is a struct , of which upper/lower bounds is the original column type.
Added code to generate the dynamic schema and handle projection (previously it was a map , so no projection needed).
Transitive error downloading, restarting
@RussellSpitzer addressed the comments, thanks!
Actually hold on a second, looking at a small refactor to make it more generic to add a readable_metric definition in future
@RussellSpitzer should be good now for another look when you get a chance, thanks!
Yep , test should be here: https://github.com/apache/iceberg/blob/6681dba9bc7dc0d793aa8de739d2b9962260b0ff/spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/source/TestMetadataTableReadableMetrics.java
Would love to see what is a good way to simplify it without breaking the checks. Currently compares every single field.
Thanks @RussellSpitzer @aokolnychyi @chenjunjiedada for detailed reviews
@szehon-ho @RussellSpitzer Is there any document about these readable metrics ? All these metrics are exposed using files metadata only ?
Closes #4362
This adds following columns to all files tables:
- readable_metrics, which is struct of:
- column_sizes
- value_counts
- null_value_counts
- nan_value_counts
- lower_bounds
- upper_bounds
These are then a map of column_name to value.
@szehon-ho Actual column names are without 's' in the end