[SUPPORT] Issues w/ incremental query in MOR table
Describe the problem you faced
The problem that the parquet file cannot be found when using spark to incrementally read the MOR table, When reading, no write transactions are executed
Steps to reproduce the behavior:
- write one commit mor table and with option "hoodie.cleaner.policy": "KEEP_LATEST_FILE_VERSIONS", "hoodie.cleaner.fileversions.retained": 24, "hoodie.compact.inline": "true", "hoodie.compact.inline.max.delta.commits": 10, "hoodie.keep.min.commits": 99, "hoodie.keep.max.commits": 100, 2.After the first batch is submitted, it can be read incrementally 3.After writing a few more batches, the incremental read error occurs, but the read-optimized view and snapshot can be read normally
Expected behavior
After the mor table is updated for several commits, the incremental query fails to find the parquet file, but the read-optimized table and the real-time table can be accessed normally
Environment Description
-
Hudi version : 0.7.0
-
Spark version : 2.4.0
-
Hive version : 2.1.1
-
Hadoop version : 3.0.0
-
Storage (HDFS/S3/GCS..) : HDFS
-
Running on Docker? (yes/no) : no
Additional context
Add any other context about the problem here.
Stacktrace
>>> op = {'hoodie.datasource.query.type': 'incremental','hoodie.datasource.read.begin.instanttime': '0'}
>>> spark.read.format("hudi").options(**op).load("/user/hive/warehouse/test.db/hudi_mor").count()
[Stage 19:> (1 + 20) / 25839]22/08/06 01:02:27 WARN scheduler.TaskSetManager: Lost task 20.0 in stage 19.0 (TID 12328, xx.com, executor 37): java.io.FileNotFoundException: File does not exist: hdfs://nameservice1/user/hive/warehouse/test.db/hudi_mor/par=4b/166dc4dd-fe33-47fe-8b1b-23b834a1c3e4-0_4846-55-139569_20220805223417.parquet
at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1499)
at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1492)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1507)
at org.apache.parquet.hadoop.util.HadoopInputFile.fromPath(HadoopInputFile.java:39)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:413)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.footerFileMetaData$lzycompute$1(ParquetFileFormat.scala:371)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.footerFileMetaData$1(ParquetFileFormat.scala:370)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:374)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:352)
at org.apache.hudi.HoodieMergeOnReadRDD.read(HoodieMergeOnReadRDD.scala:98)
at org.apache.hudi.HoodieMergeOnReadRDD.compute(HoodieMergeOnReadRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
commits show
╔════════════════╤═════════════════════╤═══════════════════╤═════════════════════╤══════════════════════════╤═══════════════════════╤══════════════════════════════╤══════════════╗
║ CommitTime │ Total Bytes Written │ Total Files Added │ Total Files Updated │ Total Partitions Written │ Total Records Written │ Total Update Records Written │ Total Errors ║
╠════════════════╪═════════════════════╪═══════════════════╪═════════════════════╪══════════════════════════╪═══════════════════════╪══════════════════════════════╪══════════════╣
║ 20220806004543 │ 2.6 GB │ 0 │ 6472 │ 256 │ 13459201 │ 7297 │ 0 ║
╟────────────────┼─────────────────────┼───────────────────┼─────────────────────┼──────────────────────────┼───────────────────────┼──────────────────────────────┼──────────────╢
║ 20220805232843 │ 2.7 GB │ 0 │ 7499 │ 256 │ 13455632 │ 8771 │ 0 ║
╟────────────────┼─────────────────────┼───────────────────┼─────────────────────┼──────────────────────────┼───────────────────────┼──────────────────────────────┼──────────────╢
║ 20220805223417 │ 256.1 GB │ 0 │ 12910 │ 256 │ 1322056815 │ 174778 │ 0 ║
╟────────────────┼─────────────────────┼───────────────────┼─────────────────────┼──────────────────────────┼───────────────────────┼──────────────────────────────┼──────────────╢
║ 20220805222939 │ 2.7 GB │ 0 │ 7809 │ 256 │ 13446221 │ 9105 │ 0 ║
╟────────────────┼─────────────────────┼───────────────────┼─────────────────────┼──────────────────────────┼───────────────────────┼──────────────────────────────┼──────────────╢
║ 20220805212946 │ 2.6 GB │ 0 │ 3534 │ 256 │ 13422116 │ 3580 │ 0 ║
╟────────────────┼─────────────────────┼───────────────────┼─────────────────────┼──────────────────────────┼───────────────────────┼──────────────────────────────┼──────────────╢
║ 20220805202932 │ 2.6 GB │ 0 │ 3599 │ 256 │ 13418043 │ 3670 │ 0 ║
╟────────────────┼─────────────────────┼───────────────────┼─────────────────────┼──────────────────────────┼───────────────────────┼──────────────────────────────┼──────────────╢
║ 20220805192931 │ 2.6 GB │ 0 │ 3210 │ 256 │ 13412437 │ 3192 │ 0 ║
╟────────────────┼─────────────────────┼───────────────────┼─────────────────────┼──────────────────────────┼───────────────────────┼──────────────────────────────┼──────────────╢
║ 20220805183958 │ 2.6 GB │ 0 │ 5240 │ 256 │ 13418101 │ 5666 │ 0 ║
╟────────────────┼─────────────────────┼───────────────────┼─────────────────────┼──────────────────────────┼───────────────────────┼──────────────────────────────┼──────────────╢
║ 20220805172927 │ 2.6 GB │ 0 │ 6840 │ 256 │ 13408492 │ 7790 │ 0 ║
╟────────────────┼─────────────────────┼───────────────────┼─────────────────────┼──────────────────────────┼───────────────────────┼──────────────────────────────┼──────────────╢
║ 20220805171751 │ 2.6 GB │ 0 │ 4237 │ 256 │ 13377870 │ 4469 │ 0 ║
╟────────────────┼─────────────────────┼───────────────────┼─────────────────────┼──────────────────────────┼───────────────────────┼──────────────────────────────┼──────────────╢
║ 20220805170955 │ 2.6 GB │ 0 │ 4368 │ 256 │ 13359451 │ 4600 │ 0 ║
╟────────────────┼─────────────────────┼───────────────────┼─────────────────────┼──────────────────────────┼───────────────────────┼──────────────────────────────┼──────────────╢
║ 20220805170021 │ 2.6 GB │ 0 │ 8856 │ 256 │ 13364422 │ 10938 │ 0 ║
╟────────────────┼─────────────────────┼───────────────────┼─────────────────────┼──────────────────────────┼───────────────────────┼──────────────────────────────┼──────────────╢
║ 20220805144130 │ 511.7 GB │ 25839 │ 0 │ 256 │ 2632982405 │ 0 │ 0 ║
╚════════════════╧═════════════════════╧═══════════════════╧═════════════════════╧══════════════════════════╧═══════════════════════╧══════════════════════════════╧══════════════╝
cleans show
╔═══════════╤═════════════════════════╤═════════════════════╤══════════════════╗
║ CleanTime │ EarliestCommandRetained │ Total Files Deleted │ Total Time Taken ║
╠═══════════╧═════════════════════════╧═════════════════════╧══════════════════╣
║ (empty) ║
╚══════════════════════════════════════════════════════════════════════════════╝
compations show all
╔═════════════════════════╤═══════════╤═══════════════════════════════╗
║ Compaction Instant Time │ State │ Total FileIds to be Compacted ║
╠═════════════════════════╪═══════════╪═══════════════════════════════╣
║ 20220805223417 │ COMPLETED │ 12910 ║
╚═════════════════════════╧═══════════╧═══════════════════════════════╝
What is certain is that there are no write transactions running during incremental reads, why do I get such an error?
There could be issues with MOR incremental query in Hudi 0.7.0. Since then MOR incremental reads have been improved. Have you tried Hudi 0.11.1 or the latest master to see if the problem still exists in your case?
there are some known limitations w/incrmental query. for eg, there is some interplay b/w cleaner and incremental query. if cleaner has cleaned up the data file pertaining to commit Cn, and if you trigger incremental query w/ Cn, you may see FileNotFoundIssue. you may have to relax the cleaner configs if you wish to do incremental for older commits.
@15663671003 : gentle ping. do you have any more specific questions for us?
@15663671003 : gentle ping. do you have any more specific questions for us?
@nsivabalan After triggering compact and clean, the incremental query will lose some data. I generally understand this phenomenon. I want to know whether the reason for losing this data is archive or clean. I should increase the value of "hoodie.keep.min.commits" or "hoodie.cleaner.hours.retained" or both? pls help me.
@15663671003 This is due to cleaner. I would suggest to retain more commits, whiich can be achieved by increasing the value of both the configs you mentioned above.
@15663671003 Any update on the issue?
will go ahead and close out the issue. Please do file new issue if the above suggestion does not work. thanks!