[SPARK-54593][SQL] Expand DPP for LogicalRelation and LogicalRDD
What changes were proposed in this pull request?
This PR extends Dynamic Partition Pruning (DPP) support to include LocalRelation and LogicalRDD as selective predicates in the PartitionPruning optimizer rule.
- Modified
hasSelectivePredicate()to treatLocalRelationandLogicalRDDas selective predicates - Modified
calculatePlanOverhead()to handleLocalRelationandLogicalRDDwith statistics as cached data sources with zero overhead - Added helper method
isLogicalRDDWithStats()to distinguish LogicalRDDs with materialized statistics from those with default estimates
https://issues.apache.org/jira/browse/SPARK-54593
Why are the changes needed?
Expanding from previous commit and Jira ticket: https://github.com/apache/spark/pull/53263 and https://issues.apache.org/jira/browse/SPARK-54554
LocalRelation (from VALUES clauses) and LogicalRDD (from checkpoint or createDataFrame with statistics) represent small, materialized datasets that are ideal candidates for DPP optimization. However, the current implementation only recognizes Filter, but not these node types as selective predicates, missing optimization opportunities in broadcast joins.
By enabling DPP for these cases, queries joining partitioned tables with small in-memory datasets can benefit from runtime partition pruning, reducing data scanning and improving query performance.
Does this PR introduce any user-facing change?
No. This is a pure optimizer enhancement. Users may observe improved query performance for joins between partitioned tables and small datasets created via VALUES clauses or checkpoint operations, but there are no API or behavioral changes.
How was this patch tested?
Added 5 comprehensive tests to DynamicPartitionPruningSuite:
- DPP with LocalRelation in broadcast join- Verifies DPP triggers for VALUES clause
- DPP with LogicalRDD from cached DataFrame- Verifies DPP triggers for createDataFrame with RDD
- DPP with empty LocalRelation- Ensures empty datasets don't cause failures
- DPP should not trigger for LogicalRDD without originStats- Negative test verifying LogicalRDD without statistics doesn't trigger DPP
- DPP with large LocalRelation- Verifies DPP works with multiple values
All tests explicitly verify DynamicPruningSubquery appears (or doesn't appear) in the optimized logical plan and use exact result verification with checkAnswer. All existing tests continue to pass.
Was this patch authored or co-authored using generative AI tooling?
No