incubator-xtable icon indicating copy to clipboard operation
incubator-xtable copied to clipboard

Unable to read the Iceberg table in Athena that was converted from Hudi to Iceberg format using XTable

Open rangareddy opened this issue 1 year ago • 2 comments

Search before asking

  • [X] I had searched in the issues and found no similar issues.

Please describe the bug 🐞

Team, I have converted Hudi table to Iceberg table using Xtable. From athena if i query the table getting the following error:

ICEBERG_BAD_DATA: Field last_modified_time's type INT64 in parquet file s3a://<table_name>/<partiton_name>/<parquet_file_name>.parquet is incompatible with type timestamp(6) with time zone defined in table schema This query ran against the "<database_name>" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: 1f0401d0-584e-4eec-8a2d-9f719a85973c

Hudi Table Schema:

CREATE EXTERNAL TABLE `default.my_table`(
  `_hoodie_commit_time` string, 
  `_hoodie_commit_seqno` string, 
  `_hoodie_record_key` string, 
  `_hoodie_partition_path` string, 
  `_hoodie_file_name` string, 
  `my_col` double, 
  `last_modified_time` bigint)
PARTITIONED BY ( 
  `partiton_id` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
WITH SERDEPROPERTIES ( 
  'hoodie.query.as.ro.table'='false', 
  'path'='s3a://<bucket_name>/my_table') 
STORED AS INPUTFORMAT 
  'org.apache.hudi.hadoop.HoodieParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3a://<bucket_name>/my_table'
TBLPROPERTIES (
  'bucketing_version'='2', 
  'hudi.metadata-listing-enabled'='FALSE', 
  'isRegisteredWithLakeFormation'='false', 
  'last_commit_completion_time_sync'='20241121011339000', 
  'last_commit_time_sync'='20241121011254282', 
  'last_modified_by'='hadoop', 
  'last_modified_time'='1732162935', 
  'spark.sql.create.version'='3.5.2-amzn-1', 
  'spark.sql.sources.provider'='hudi', 
  'spark.sql.sources.schema.numPartCols'='1', 
  'spark.sql.sources.schema.numParts'='1', 
  'spark.sql.sources.schema.part.0'='{\"type\":\"struct\",\"fields\":[{\"name\":\"_hoodie_commit_time\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},
  {\"name\":\"_hoodie_commit_seqno\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"_hoodie_record_key\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}, {\"name\":\"_hoodie_partition_path\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"_hoodie_file_name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},
  {\"name\":\"my_col\",\"type\":\"double\",\"nullable\":true,\"metadata\":{}},{\"name\":\"last_modified_time\",\"type\":\"timestamp\",\"nullable\":true,\"metadata\":{}},
  {\"name\":\"partiton_id\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}', 
  'spark.sql.sources.schema.partCol.0'='partiton_id', 
  'transient_lastDdlTime'='1732162935')

Are you willing to submit PR?

  • [ ] I am willing to submit a PR!
  • [ ] I am willing to submit a PR but need help getting started!

Code of Conduct

rangareddy avatar Nov 22 '24 10:11 rangareddy

@rangareddy what is the data type for the field in the parquet file? I see that the last_modified_time is listed as bigint and also timestamp in the DDL. In Hudi, you'd need to use a logical type for a timestamp field

the-other-tim-brown avatar Nov 22 '24 22:11 the-other-tim-brown

@rangareddy since you're testing with athena, you can ignore those table properties spark.sql.*. The problem is that the parquet file contains timestamp with timezone type, but the DDL makes it bigint, which violate some iceberg checks. See if there are any config to bypass this validation in iceberg, and have you tried creating the table with timestamp type for last_modified_time and that should work?

xushiyan avatar Nov 23 '24 03:11 xushiyan