there is no data when a couple of hudi tables join
There is one etl job run every hour and it is insert overwrite one table from the results that is generated by some hudi table join. It happens like one a week that there is no data inserted.
Environment Description
Hudi version : 0.9.1
Spark version : 3.0.1
Hive version : 3
Hadoop version : 3.2.2
Storage (HDFS/S3/GCS..) : s3
Running on Docker? (yes/no) : no
what cab be the reason ? Is there any way to debug this kind of issues or how to get the more metrics for it?
Just to clarify, you are having one etl table which loads a full refresh table using other multiple hudi tables involving joins. Once a week you are seeing that that table is loaded with no data.
To debug this, when this happens you can try to see if that join is resulting any data. You can also use point in time queries to exactly get data for the tables at that time.
Whether it is related to below warning:
23/12/19 09:23:28 INFO Executor: Running task 4.0 in stage 0.0 (TID 4) 23/12/19 09:23:28 WARN InMemoryFileIndex: The directory xxx/testing/2681717c-28d2-4f56-9664-4037cbe67c9b-0_2-221-9002_20231219071501.parquet was not found. Was it deleted very recently? 23/12/19 09:23:28 INFO Executor: Finished task 4.0 in stage 0.0 (TID 4). 930 bytes result sent to driver 23/12/19 09:23:28 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 5 23/12/19 09:23:28 INFO Executor: Running task 5.0 in stage 0.0 (TID 5) 23/12/19 09:23:28 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 906 bytes result sent to driver 23/12/19 09:23:28 WARN InMemoryFileIndex: The directory xxx/testing/2681717c-28d2-4f56-9664-4037cbe67c9b-1_2-221-9002_20231219071501.parquet was not found. Was it deleted very recently? 23/12/19 09:23:28 INFO Executor: Finished task 5.0 in stage 0.0 (TID 5). 930 bytes result sent to driver 23/12/19 09:23:28 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 6 23/12/19 09:23:28 INFO Executor: Running task 6.0 in stage 0.0 (TID 6) 23/12/19 09:23:28 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 7 23/12/19 09:23:28 INFO Executor: Running task 7.0 in stage 0.0 (TID 7) 23/12/19 09:23:28 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from xxx/testing 23/12/19 09:23:28 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from xxx/testing 23/12/19 09:23:28 INFO HoodieTableConfig: Loading table properties from xxx/testing/.hoodie/hoodie.properties
@njalan Don't think if its related and can cause this. We may be getting this as there may be another process which is updating the source table simultaneously? Is that correct?
@njalan Were you able to resolve this or understand the root cause. Please do let us know. thanks.