Querying Hudi Table Created With Version 0.12.3 Not Working on Trino 430
Tips before filing an issue
-
Have you gone through our FAQs?
-
Join the mailing list to engage in conversations and get faster support at [email protected].
-
If you have triaged this as a bug, then file an issue directly.
Describe the problem you faced
I have hudi created with version 0.12.3, when I am trying to Query it using Trino it is able to even start read the tables. But when i create the same table with version 0.12.1. I am able to query it using Trino
To Reproduce
Steps to reproduce the behavior:
- Trino EKS Setup File trino.txt trino.txt
- Create Hudi Table using EMR with Hudi DELTASTREAMER 0.12.3. JAR of Utility. https://repo1.maven.org/maven2/org/apache/hudi/hudi-utilities-bundle_2.12/0.12.3/hudi-utilities-bundle_2.12-0.12.3.jar https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.3-bundle_2.12/0.12.3/hudi-spark3.3-bundle_2.12-0.12.3.jar
- Properties OF HUDI USED BELOW: "hoodie.schema.on.read.enable": "true" "hoodie.cleaner.commits.retained": "3", "hoodie.datasource.write.reconcile.schema": "true", "hoodie.parquet.compression.codec": "zstd", "hoodie.delete.shuffle.parallelism": "200", "hoodie.parquet.max.file.size": "268435456", "hoodie.upsert.shuffle.parallelism": "200", "hoodie.datasource.hive_sync.support_timestamp": "true", "hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.CustomKeyGenerator", "hoodie.datasource.write.hive_style_partitioning": "true", "hoodie.insert.shuffle.parallelism": "200", "hoodie.parquet.small.file.limit": "134217728", "hoodie.bootstrap.parallelism": "200", "hoodie.embed.timeline.server": "true", "hoodie.bulkinsert.shuffle.parallelism": "200", "hoodie.datasource.hive_sync.enable": "true", "hoodie.filesystem.view.type": "EMBEDDED_KV_STORE", "hoodie.clean.max.commits": "4" hoodie.metadata.enable: true spark.hadoop.fs.s3.canned.acl: BucketOwnerFullControl hoodie.datasource.hive_sync.support_timestamp=true
- I am using KAFKA as Source, here and syncing in table in glue Catalog.
- When I run simple query on Trino like "Select * from hudi_table " It is not able to load. 7.SAME properties used for crearting HUDI table with Version 0.12.1. I am able to query it. https://repo1.maven.org/maven2/org/apache/hudi/hudi-utilities-bundle_2.12/0.12.1/hudi-utilities-bundle_2.12-0.12.1.jar https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.3-bundle_2.12/0.12.1/hudi-spark3.3-bundle_2.12-0.12.1.jar
Expected behavior
A clear and concise description of what you expected to happen.
Environment Description
-
Hudi version : 0.12.3
-
Spark version : 3.3
-
Hive version :
-
Hadoop version :
-
Storage (HDFS/S3/GCS..) :S3
-
Running on Docker? (yes/no) : no
-
TRINO VERSION: 430
Additional context
Add any other context about the problem here.
Stacktrace
Add the stacktrace of the error.
@Amar1404 Do you get any error when you query. @codope Do you have any insights on this?
Hi @ad1happy2go - I have found the issue is in Syncing of Table in Catalog, Since I am using GLue Catalog. But when I tried creating a table using the HudiSyncTool class the table is not working in trino. But when I used the AwsGlueCatalogSync it is working fine. Not sure what is the difference in between these two classes.
@Amar1404 Ideally HiveSync also should delegate to AwsGlueCatalogSync if Glue is enabled for EMR. So ideally should not cause any difference.