[SPARK-40508][SQL] Treat unknown partitioning as UnknownPartitioning
What changes were proposed in this pull request?
When running spark application against spark 3.3, I see the following :
java.lang.IllegalArgumentException: Unsupported data source V2 partitioning type: CustomPartitioning
at org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:46)
at org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:34)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
The CustomPartitioning works fine with Spark 3.2.1
This PR proposes to relax the code and treat all unknown partitioning the same way as that for UnknownPartitioning.
Why are the changes needed?
3.3.0 doesn't seem to warrant such behavioral change (from that of 3.2.1 release).
Does this PR introduce any user-facing change?
This would allow user's custom partitioning to continue to work with 3.3.x releases.
How was this patch tested?
Existing test suite.
cc @sunchao
@tedyu could you share some background information? if CustomPartitioning is handled the same way as UnknownPartitioning, why can't you just use the latter instead?
If I subclass UnknownPartitioning directly, I would get this compilation error:
[error] /nfusr/dev-server/zyu/spark-cassandra-connector/connector/src/main/scala/com/datastax/spark/connector/datasource/CassandraScanBuilder.scala:327:92: not enough arguments for constructor UnknownPartitioning: (x$1: Int)org.apache.spark.sql.connector.read.partitioning.UnknownPartitioning.
[error] Unspecified value parameter x$1.
[error] case class CassandraPartitioning(partitionKeys: Array[String], numPartitions: Int) extends UnknownPartitioning {
[error] ^
[error] one error found
Can you directly report UnknownPartitioning to Spark?
If custom partitioning reports UnknownPartitioning to Spark and can keep 3.2.1 behavior, that means the current check is not desired.
I have run the test using Cassandra Spark connector and modified Spark (with this patch).
The test passes (without modification to Cassandra Spark connector or client code).
I guess this PR make sense. @tedyu could you:
- create a Spark JIRA for this issue? and update the PR title to reflect it?
- add a warning message too? clients may expect Spark to use the partitioning they reported and could be surprised that Spark internally ignores it, so a warning message would be helpful for them to debug.
I think the best solution is for connectors such as Cassandra to adopt the new API, otherwise they could see severe performance penalties.
@sunchao Please take another look.
@sunchao https://github.com/tedyu/spark/runs/8459534296 shows that all tests have passed.
@sunchao Can this PR be merged ?
Thanks! merged to master. @tedyu could you create a PR for branch-3.3 as well?
Yea good point. A unit test would be nice.