hudi icon indicating copy to clipboard operation
hudi copied to clipboard

File not found while using metadata table for insert_overwrite table

Open Shubham21k opened this issue 2 years ago • 7 comments

We are incrementally writing to a hudi table with insert_overwrite operations. Recently, We enabled Hudi metadata table for these tables. However after few days we started to encounter the FileNotFoundException issue while reading these tables from athena (with metadata listing enabled). Upon further investigation, we observed that the metadata contained older files references that were cleaned up by the cleaner and are no longer available.

Steps to reproduce the behavior:

  1. create a simple df and write to a hudi table incrementally with these properties
hoodie.datasource.meta.sync.enable=true 
hoodie.meta.sync.client.tool.class=org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool 
hoodie.write.markers.type=DIRECT 
**hoodie.metadata.enable=true 
hoodie.datasource.write.operation=insert_overwrite** 
hoodie.datasource.write.partitionpath.field=cs_load_hr 
hoodie.datasource.hive_sync.partition_fields=cs_load_hr 
partition.assignment.strategy=org.apache.kafka.clients.consumer.RangeAssignor 
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.TimestampBasedKeyGenerator 
hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING 
hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyy/MM/dd/HH 
hoodie.deltastreamer.source.hoodieincr.partition.extractor.class=org.apache.hudi.hive.SlashEncodedHourPartitionValueExtractor 
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.SlashEncodedHourPartitionValueExtractor 
hoodie.parquet.compression.codec=snappy 
hoodie.table.services.enabled=true 
hoodie.rollback.using.markers=false 
hoodie.commits.archival.batch=30 
hoodie.archive.delete.parallelism=500 
hoodie.index.type=SIMPLE 
hoodie.clean.allow.multiple=false 
hoodie.clean.async=true 
hoodie.clean.automatic=true 
hoodie.cleaner.policy=KEEP_LATEST_COMMITS 
hoodie.cleaner.commits.retained=3 
hoodie.cleaner.parallelism=500 
hoodie.cleaner.incremental.mode=true 
hoodie.clean.max.commits=8 
hoodie.archive.async=true 
hoodie.archive.automatic=true 
hoodie.archive.merge.enable=true 
hoodie.archive.merge.files.batch.size=60 
hoodie.keep.max.commits=10 
hoodie.keep.min.commits=5

df.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Append).save(hudiOutputTablePath)

  1. after few incremental writes, some of the base files should be updated & metadata does not get updated properly, it continues to persist old files pointer as well.
  2. if you try reading the table using spark or athena, you will get FileNotFoundException keep in mind to enable metadata while reading. upon disabling the metadata listing on the read side, there is no error and reads work fine.
  3. Note : We have observed this issue only for insert_overwrite operations. Upsert operation table's metadata gets updated correctly.

Expected behavior

It is expected that the hoodie metadata gets updated correctly.

Environment Description

  • Hudi version : 0.13.1

  • Spark version : 3.2.1

  • Hive version : NA

  • Hadoop version :

  • Storage (HDFS/S3/GCS..) : S3

  • Running on Docker? (yes/no) :

Additional context

The timeline also contains replaceCommits for corrupted tables. (which are not present in case of upsert table)

$ aws s3 ls s3://tmp-data/investments_ctr_tbl/.hoodie/
                           PRE .aux/
                           PRE archived/
                           PRE metadata/
2023-12-08 13:32:17          0 .aux_$folder$
2023-12-08 13:32:17          0 .schema_$folder$
2023-12-08 13:32:17          0 .temp_$folder$
2023-12-14 22:17:18       4678 20231214221641350.clean
2023-12-14 22:17:11       3227 20231214221641350.clean.inflight
2023-12-14 22:17:10       3227 20231214221641350.clean.requested
2023-12-22 21:50:54       4439 20231222214849300.clean
2023-12-22 21:50:45       4337 20231222214849300.clean.inflight
2023-12-22 21:50:45       4337 20231222214849300.clean.requested
2023-12-30 21:51:16       4439 20231230214431936.clean
2023-12-30 21:51:07       4337 20231230214431936.clean.inflight
2023-12-30 21:51:07       4337 20231230214431936.clean.requested
2024-01-07 21:53:30       4439 20240107215204594.clean
2024-01-07 21:53:23       4337 20240107215204594.clean.inflight
2024-01-07 21:53:22       4337 20240107215204594.clean.requested
2024-01-15 21:55:00       4439 20240115215112126.clean
2024-01-15 21:54:52       4337 20240115215112126.clean.inflight
2024-01-15 21:54:52       4337 20240115215112126.clean.requested
2024-01-23 21:46:53       4439 20240123214442067.clean
2024-01-23 21:46:45       4337 20240123214442067.clean.inflight
2024-01-23 21:46:45       4337 20240123214442067.clean.requested
2024-01-27 21:42:46       4107 20240127214137934.replacecommit
2024-01-27 21:42:22       2221 20240127214137934.replacecommit.inflight
2024-01-27 21:41:42       2522 20240127214137934.replacecommit.requested
2024-01-28 21:47:38       4107 20240128214645441.replacecommit
2024-01-28 21:47:15       2221 20240128214645441.replacecommit.inflight
2024-01-28 21:46:50       2522 20240128214645441.replacecommit.requested
2024-01-29 21:48:52       4107 20240129214748785.replacecommit
2024-01-29 21:48:23       2221 20240129214748785.replacecommit.inflight
2024-01-29 21:47:55       2522 20240129214748785.replacecommit.requested
2024-01-30 21:47:51       4107 20240130214652235.replacecommit
2024-01-30 21:47:26       2221 20240130214652235.replacecommit.inflight
2024-01-30 21:46:58       2522 20240130214652235.replacecommit.requested
2024-01-31 21:46:06       4107 20240131214409159.replacecommit
2024-01-31 21:45:47       2221 20240131214409159.replacecommit.inflight
2024-01-31 21:44:16       2522 20240131214409159.replacecommit.requested
2024-01-31 21:45:57       4439 20240131214417071.clean
2024-01-31 21:45:49       4337 20240131214417071.clean.inflight
2024-01-31 21:45:49       4337 20240131214417071.clean.requested
2024-02-01 21:46:04       4107 20240201214350334.replacecommit
2024-02-01 21:45:35       2221 20240201214350334.replacecommit.inflight
2024-02-01 21:43:59       2522 20240201214350334.replacecommit.requested
2024-02-02 21:48:21       4107 20240202214718803.replacecommit
2024-02-02 21:47:56       2221 20240202214718803.replacecommit.inflight
2024-02-02 21:47:26       2522 20240202214718803.replacecommit.requested
2024-02-03 21:52:09       4107 20240203215112346.replacecommit
2024-02-03 21:51:46       2221 20240203215112346.replacecommit.inflight
2024-02-03 21:51:18       2522 20240203215112346.replacecommit.requested
2024-02-04 21:49:18       4107 20240204214759805.replacecommit
2024-02-04 21:48:40       2221 20240204214759805.replacecommit.inflight
2024-02-04 21:48:06       2522 20240204214759805.replacecommit.requested
2024-02-05 14:13:44      13208 20240205141148519.replacecommit
2024-02-05 14:13:21       8512 20240205141148519.replacecommit.inflight
2024-02-05 14:11:52       2522 20240205141148519.replacecommit.requested
2023-12-08 13:32:17          0 archived_$folder$
2023-12-08 13:32:25       1130 hoodie.properties
2023-12-08 13:32:21          0 metadata_$folder$

Also, here is the output of the command metadata validate-files ran from the hudi-cli on the corrupted table

Shubham21k avatar Feb 05 '24 20:02 Shubham21k

@Shubham21k Can you try with 0.14.1 once. Also, async services doesn't work with Datasource writer.

I tried to reproduce this but unable to do it. Can you check in case you can enhance the same and reproduce it.

Code here - https://gist.github.com/ad1happy2go/364e66c4fa84229110f28994cc4a277f/edit

ad1happy2go avatar Feb 06 '24 17:02 ad1happy2go

@Shubham21k What queries you are trying on this data? Does select * works? For point in time queries, this error is expected in case the commit is not archive but cleaned.

ad1happy2go avatar Feb 06 '24 17:02 ad1happy2go

@ad1happy2go

  1. all the select queries are failing due to this error. As suggested, will try and check if this gets fixed in 0.14.1
  2. 'async services doesn't work with Datasource writer' - can you elaborate more as we have observed the cleaner & archival taking place with these configurations.

not able to open the code link for reproducing the error shared by you

Shubham21k avatar Feb 07 '24 07:02 Shubham21k

Is there a work around for this issue. We are facing a similar issue as well.

shravanak avatar Feb 08 '24 00:02 shravanak

@Shubham21k Code link here - https://gist.github.com/ad1happy2go/364e66c4fa84229110f28994cc4a277f

Async services are meant to run with streaming workloads like Hudi Streamer, so that table services can run asynchronously and doesn't block the ingestion of next micro batch. Having it with Data source writers (batch writers) doesn't make any sense and inline table services will be kicked in.

@shravanak Which Hudi version you are using? Are you also using insert_overwrite. Can you elaborate.

ad1happy2go avatar Feb 08 '24 06:02 ad1happy2go

We are using insert write mode with hudi 0.14.0 I think the file or partition it is referring to missing might be before we upgraded to 0.14.0 which was on 0.12.2

shravanak avatar Feb 08 '24 07:02 shravanak

@shravanak That may be the cause probably. Did you faced this issue with other tables also?

ad1happy2go avatar Feb 14 '24 07:02 ad1happy2go

@shravanak Are you still facing this issue? Let us know in case you need help here.

ad1happy2go avatar Mar 06 '24 13:03 ad1happy2go

hey @ad1happy2go : if this turns out to be MDT data consistency issue, do keep me posted. thanks.

nsivabalan avatar Apr 09 '24 01:04 nsivabalan