About the support for Schema Evolution (column rename)
Hi there, I tested onetable tool with a created an Iceberg Table and run a column rename.
spark-sql> CREATE TABLE hadoop_prod.repro_rename ( id bigint, data string) using iceberg;
Response code
Time taken: 2.269 seconds
spark-sql> insert into hadoop_prod.repro_rename values (1, "abc");
Response code
Time taken: 8.093 seconds
spark-sql> select * from hadoop_prod.repro_rename;
id data
1 abc
spark-sql> alter table hadoop_prod.repro_rename rename column id to new_id ;
Time taken: 2.255 seconds
spark-sql> select * from hadoop_prod.repro_rename;
new_id data
1 abc
Run onetable to convert Iceberg to Hudi and Delta, and the information about the column rename doesn't seem to be captured in the converted metadata
issue 1: the schema is still using the old one
HUDI:
>>df = spark.read.format("hudi").options(**hudi_options).load("MY_PATH/repro_rename")
>>> df.show(truncate=False)
+---+----+
|id |data|
+---+----+
|1 |abc |
+---+----+
DELTA
spark-sql> select * FROM delta.`MY_PATH/repro_rename` ;
id data
1 abc
issue 2: Iceberg's column rename is built on top of field id, I don't see any Delta/Hudi equivalence are included in the converted metadata
for Delta: "delta.columnMapping.mode":"id","delta.columnMapping.maxColumnId":"4" is missing in the delta log -- see https://docs.databricks.com/en/delta/delta-column-mapping.html for the implementation of 'delta.columnMapping.mode' = 'id'
for Hudi : hudi commit log doesn’t have id, max_column_id populated (fields in https://github.com/apache/hudi/pull/4910/files )
That would be a great feature to have. Since Iceberg and Delta supports column renames we can start with supporting those. @huan233usc Do you want to pick that up ?
@vamshigv if someone can guide me a bit I am happy to pick it up