iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

Core: Add readable metrics columns to files metadata tables

Open szehon-ho opened this issue 3 years ago • 2 comments

Closes #4362

This adds following columns to all files tables:

  • column_sizes_metrics
  • value_counts_metrics
  • null_value_counts_metrics
  • nan_value_counts_metrics
  • lower_bounds_metrics
  • upper_bounds_metrics

This is to keep backward compatibility as the existing metrics columns can not be changed.

The first four return Map<String, Long>. Key is the human-readable column name (dot separated for nested columns). The last two return Map<String, String>. Key is like above, Value is human-readable upper/lower bound.

Example:

  • value_counts_metrics = Map("mystruct.timestamp" => 1000)
  • upper_bounds_metrics = Map("mystruct.timestamp" => "1970-01-01T00:00:00.000002")

This makes Iceberg metadata tables is a bit closer to Trino, where the last two columns are <Long, String> (column id to human readable bound). It goes beyond and even resolves the column to make it readable.

Implementation detail: Not that we add new columns to files table, the rows of the table are no longer a 1-to-1 mapping with the "DataFile" java object getters, so we have to add code to handle column mapping in projection case.

szehon-ho avatar Jul 28 '22 21:07 szehon-ho

All Spark tests are updated/fixed now.

Fyi @RussellSpitzer @aokolnychyi @rdblue if you guys have time to leave some feedback. There are new tests added but its a bit big to show 'Files Changed': TestMetadataTableMetricsColumns.java

szehon-ho avatar Aug 02 '22 18:08 szehon-ho

It seems like a great idea to add readable metrics. It is hard to make sense of them otherwise.

@szehon-ho, what do you think about adding a single map column, let's say called readable_metrics, that will hold a mapping from a column name into a struct that would represent metrics? The type will be Map<String, StructType> and we will have individual struct fields for each type of metric.

We can then easily access them via SQL.

SELECT readable_metrics['col1'].lower_bound FROM db.t.files

I am okay with individual columns too but it seems a bit cleaner to just have one.

aokolnychyi avatar Aug 10 '22 04:08 aokolnychyi

Let me check in a bit.

aokolnychyi avatar Aug 12 '22 15:08 aokolnychyi

Let me take a look today.

aokolnychyi avatar Aug 22 '22 23:08 aokolnychyi

Let me take a look now.

aokolnychyi avatar Aug 26 '22 18:08 aokolnychyi

Added additional test, looks it is working even when readable_metric column is selected before other columns (spark somehow calls the rows in their original order)

szehon-ho avatar Aug 31 '22 23:08 szehon-ho

Really nice PR, thanks @szehon-ho and @aokolnychyi for the effort! When can we merge this? I think it is ready and has been two months since the last review, which will lead to more conflicts if leave it.

chenjunjiedada avatar Nov 02 '22 12:11 chenjunjiedada

updated and rebased, @RussellSpitzer if you have time to take a look as well

szehon-ho avatar Nov 04 '22 02:11 szehon-ho

Update. chatted offline with @RussellSpitzer will spend a few days if its possible to make the type dynamic struct instead of static map, to get the right types for lower, upper bounds.

szehon-ho avatar Nov 09 '22 08:11 szehon-ho

@RussellSpitzer i think it should be as we discussed now, the readable_metrics is a dynamic type of column metrics for all primitive columns by qualified name. Each column metric is a struct , of which upper/lower bounds is the original column type.

Added code to generate the dynamic schema and handle projection (previously it was a map , so no projection needed).

szehon-ho avatar Nov 15 '22 10:11 szehon-ho

Transitive error downloading, restarting

szehon-ho avatar Nov 15 '22 17:11 szehon-ho

@RussellSpitzer addressed the comments, thanks!

szehon-ho avatar Dec 02 '22 13:12 szehon-ho

Actually hold on a second, looking at a small refactor to make it more generic to add a readable_metric definition in future

szehon-ho avatar Dec 02 '22 14:12 szehon-ho

@RussellSpitzer should be good now for another look when you get a chance, thanks!

szehon-ho avatar Dec 02 '22 19:12 szehon-ho

Yep , test should be here: https://github.com/apache/iceberg/blob/6681dba9bc7dc0d793aa8de739d2b9962260b0ff/spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/source/TestMetadataTableReadableMetrics.java

szehon-ho avatar Dec 02 '22 22:12 szehon-ho

Would love to see what is a good way to simplify it without breaking the checks. Currently compares every single field.

szehon-ho avatar Dec 02 '22 22:12 szehon-ho

Thanks @RussellSpitzer @aokolnychyi @chenjunjiedada for detailed reviews

szehon-ho avatar Dec 05 '22 18:12 szehon-ho

@szehon-ho @RussellSpitzer Is there any document about these readable metrics ? All these metrics are exposed using files metadata only ?

atifiu avatar Sep 17 '23 11:09 atifiu

Closes #4362

This adds following columns to all files tables:

  • readable_metrics, which is struct of:
  • column_sizes
  • value_counts
  • null_value_counts
  • nan_value_counts
  • lower_bounds
  • upper_bounds

These are then a map of column_name to value.

@szehon-ho Actual column names are without 's' in the end

atifiu avatar Sep 17 '23 11:09 atifiu