Unable to read the Iceberg table in Athena that was converted from Hudi to Iceberg format using XTable
Search before asking
- [X] I had searched in the issues and found no similar issues.
Please describe the bug 🐞
Team, I have converted Hudi table to Iceberg table using Xtable. From athena if i query the table getting the following error:
ICEBERG_BAD_DATA: Field last_modified_time's type INT64 in parquet file s3a://
Hudi Table Schema:
CREATE EXTERNAL TABLE `default.my_table`(
`_hoodie_commit_time` string,
`_hoodie_commit_seqno` string,
`_hoodie_record_key` string,
`_hoodie_partition_path` string,
`_hoodie_file_name` string,
`my_col` double,
`last_modified_time` bigint)
PARTITIONED BY (
`partiton_id` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'hoodie.query.as.ro.table'='false',
'path'='s3a://<bucket_name>/my_table')
STORED AS INPUTFORMAT
'org.apache.hudi.hadoop.HoodieParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3a://<bucket_name>/my_table'
TBLPROPERTIES (
'bucketing_version'='2',
'hudi.metadata-listing-enabled'='FALSE',
'isRegisteredWithLakeFormation'='false',
'last_commit_completion_time_sync'='20241121011339000',
'last_commit_time_sync'='20241121011254282',
'last_modified_by'='hadoop',
'last_modified_time'='1732162935',
'spark.sql.create.version'='3.5.2-amzn-1',
'spark.sql.sources.provider'='hudi',
'spark.sql.sources.schema.numPartCols'='1',
'spark.sql.sources.schema.numParts'='1',
'spark.sql.sources.schema.part.0'='{\"type\":\"struct\",\"fields\":[{\"name\":\"_hoodie_commit_time\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},
{\"name\":\"_hoodie_commit_seqno\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"_hoodie_record_key\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}, {\"name\":\"_hoodie_partition_path\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"_hoodie_file_name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},
{\"name\":\"my_col\",\"type\":\"double\",\"nullable\":true,\"metadata\":{}},{\"name\":\"last_modified_time\",\"type\":\"timestamp\",\"nullable\":true,\"metadata\":{}},
{\"name\":\"partiton_id\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}',
'spark.sql.sources.schema.partCol.0'='partiton_id',
'transient_lastDdlTime'='1732162935')
Are you willing to submit PR?
- [ ] I am willing to submit a PR!
- [ ] I am willing to submit a PR but need help getting started!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
@rangareddy what is the data type for the field in the parquet file? I see that the last_modified_time is listed as bigint and also timestamp in the DDL. In Hudi, you'd need to use a logical type for a timestamp field
@rangareddy since you're testing with athena, you can ignore those table properties spark.sql.*.
The problem is that the parquet file contains timestamp with timezone type, but the DDL makes it bigint, which violate some iceberg checks. See if there are any config to bypass this validation in iceberg, and have you tried creating the table with timestamp type for last_modified_time and that should work?