[SUPPORT] Spark planner choose broadcast hash join for large HUDI data source
After apply HUDI-6941 in the internal HUDI version(based on 0.14.0 version), there is a frequent occurrence of the execution plan selecting "broadcast hash join" to broadcast a large HUDI data source.
I tried to investigate the cause of this issue.
In those cases, using
HadoopFsRelation to read HUDI source, and Spark JoinSelection would call HadoopFsRelation#sizeInBytes to estimate the relation size to decide whether use broadcast join or not. And HadoopFsRelation#sizeInBytes would call HoodieFileIndex#sizeInBytes. But at the moment, no partitions are loaded because using default lazy Hudi's file-index implementation's file listing mode. So FileIndex#cachedAllInputFileSlices is an empty map, then HadoopFsRelation#sizeInBytes returns 0, it caused the suboptimal join plan.
After apply HUDI-6941, more cases could enabled lazy list mode by default, so the issue has become more frequent.
@xuzifu666 @codope Please help me confirm whether my analysis of this issue is correct. Is the FileIndex#sizeInBytes better to return the Long.MAX instead of 0 if FileIndex has not done partition pruning yet?
Already find the root cause: the query job does not set extensions as HoodieSparkSessionExtension, so the HoodiePruneFileSourcePartitions is not taking effect.
BTW, should we use an overestimate size than 0 in HoodieFileIndex#sizeInBytes for those query jobs which forget set HoodieSparkSessionExtension, to avoid broadcast a very large HUDI table, like this patch commit#be9cf?