hudi icon indicating copy to clipboard operation
hudi copied to clipboard

Querying Hudi Table Created With Version 0.12.3 Not Working on Trino 430

Open Amar1404 opened this issue 2 years ago • 2 comments

Tips before filing an issue

  • Have you gone through our FAQs?

  • Join the mailing list to engage in conversations and get faster support at [email protected].

  • If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

I have hudi created with version 0.12.3, when I am trying to Query it using Trino it is able to even start read the tables. But when i create the same table with version 0.12.1. I am able to query it using Trino

To Reproduce

Steps to reproduce the behavior:

  1. Trino EKS Setup File trino.txt trino.txt
  2. Create Hudi Table using EMR with Hudi DELTASTREAMER 0.12.3. JAR of Utility. https://repo1.maven.org/maven2/org/apache/hudi/hudi-utilities-bundle_2.12/0.12.3/hudi-utilities-bundle_2.12-0.12.3.jar https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.3-bundle_2.12/0.12.3/hudi-spark3.3-bundle_2.12-0.12.3.jar
  3. Properties OF HUDI USED BELOW: "hoodie.schema.on.read.enable": "true" "hoodie.cleaner.commits.retained": "3", "hoodie.datasource.write.reconcile.schema": "true", "hoodie.parquet.compression.codec": "zstd", "hoodie.delete.shuffle.parallelism": "200", "hoodie.parquet.max.file.size": "268435456", "hoodie.upsert.shuffle.parallelism": "200", "hoodie.datasource.hive_sync.support_timestamp": "true", "hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.CustomKeyGenerator", "hoodie.datasource.write.hive_style_partitioning": "true", "hoodie.insert.shuffle.parallelism": "200", "hoodie.parquet.small.file.limit": "134217728", "hoodie.bootstrap.parallelism": "200", "hoodie.embed.timeline.server": "true", "hoodie.bulkinsert.shuffle.parallelism": "200", "hoodie.datasource.hive_sync.enable": "true", "hoodie.filesystem.view.type": "EMBEDDED_KV_STORE", "hoodie.clean.max.commits": "4" hoodie.metadata.enable: true spark.hadoop.fs.s3.canned.acl: BucketOwnerFullControl hoodie.datasource.hive_sync.support_timestamp=true
  4. I am using KAFKA as Source, here and syncing in table in glue Catalog.
  5. When I run simple query on Trino like "Select * from hudi_table " It is not able to load. 7.SAME properties used for crearting HUDI table with Version 0.12.1. I am able to query it. https://repo1.maven.org/maven2/org/apache/hudi/hudi-utilities-bundle_2.12/0.12.1/hudi-utilities-bundle_2.12-0.12.1.jar https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.3-bundle_2.12/0.12.1/hudi-spark3.3-bundle_2.12-0.12.1.jar

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

  • Hudi version : 0.12.3

  • Spark version : 3.3

  • Hive version :

  • Hadoop version :

  • Storage (HDFS/S3/GCS..) :S3

  • Running on Docker? (yes/no) : no

  • TRINO VERSION: 430

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

Amar1404 avatar Dec 02 '23 07:12 Amar1404

@Amar1404 Do you get any error when you query. @codope Do you have any insights on this?

ad1happy2go avatar Dec 11 '23 12:12 ad1happy2go

Hi @ad1happy2go - I have found the issue is in Syncing of Table in Catalog, Since I am using GLue Catalog. But when I tried creating a table using the HudiSyncTool class the table is not working in trino. But when I used the AwsGlueCatalogSync it is working fine. Not sure what is the difference in between these two classes.

Amar1404 avatar Dec 18 '23 04:12 Amar1404

@Amar1404 Ideally HiveSync also should delegate to AwsGlueCatalogSync if Glue is enabled for EMR. So ideally should not cause any difference.

ad1happy2go avatar Jan 31 '24 15:01 ad1happy2go