[SPARK-49044][SQL] ValidateExternalType should return child in error
Jira ticket: https://issues.apache.org/jira/browse/SPARK-49044
What changes were proposed in this pull request?
In this PR was an extended error message for checkType in ValidateExternalType with information about child nodes.
Why are the changes needed?
When we have mixed schema rows, the error message "{actual} is not a valid external type for schema of {expected}" doesn't help to understand the column with the problem. I suggest adding information about the source column.
Example
class ErrorMsgSuite extends AnyFunSuite with SharedSparkContext {
test("shouldThrowSchemaError") {
val seq: Seq[Row] = Seq(
Row(
toBytes("0"),
toBytes(""),
1L,
),
Row(
toBytes("0"),
toBytes(""),
1L,
),
) val schema: StructType = new StructType()
.add("f1", BinaryType)
.add("f3", StringType)
.add("f2", LongType) val df = sqlContext.createDataFrame(sqlContext.sparkContext.parallelize(seq), schema) val exception = intercept[RuntimeException] {
df.show()
} assert(
exception.getCause.getMessage
.contains("[B is not a valid external type for schema of string")
)
assertResult(
"[B is not a valid external type for schema of string"
)(exception.getCause.getMessage)
} def toBytes(x: String): Array[Byte] = x.toCharArray.map(_.toByte)
}
After fix error message may contain extra info
[B is not a valid external type for schema of string at getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 1, f3)
Example: https://github.com/mrk-andreev/example-spark-schema/blob/main/spark_4.0.0/src/test/scala/ErrorMsgSuite.scala
Does this PR introduce any user-facing change?
This changes extends error message in case of "not a valid external type".
Example before
java.lang.Integer is not a valid external type for schema of double
Example after
java.lang.Integer is not a valid external type for schema of double at getexternalrowfield(input[0, org.apache.spark.sql.Row, true], 0, c0)
How was this patch tested?
This changes covered by new unit test test("SPARK-49044 ValidateExternalType should return child in error")
Was this patch authored or co-authored using generative AI tooling?
No
When encountering mixed schema rows, the current error message "{actual} is not a valid external type for schema of {expected}" lacks sufficient detail to identify the problematic column. This ambiguity hinders troubleshooting and increases development time.
To enhance error clarity, we propose incorporating the source column name into the error message. For example: "Column 'my_column' has an actual type of {actual} which is not a valid external type for the expected schema of {expected}."
By providing this additional context, developers can more efficiently pinpoint and resolve schema mismatches.
+1, LGTM. Merging to master. Thank you, @mrk-andreev.
@mrk-andreev BTW, could you update PR's description according the recent changes, please. For instance:
After fix error message may contain extra info
[B is not a valid external type for schema of string at getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 1, f3)
Thank you Maxim,
I extended the PR description with the correct error message and also added into "Why are the changes needed?" a short example presented by you.
On Tue, 3 Sept 2024 at 07:44, Maxim Gekk @.***> wrote:
@mrk-andreev https://github.com/mrk-andreev BTW, could you update PR's description according the recent changes, please. For instance:
After fix error message may contain extra info
[B is not a valid external type for schema of string at getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 1, f3)
— Reply to this email directly, view it on GitHub https://github.com/apache/spark/pull/47522#issuecomment-2325722098, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFDNI27ECMWP76MUJGDKKLZUVLEZAVCNFSM6AAAAABLUUBVMSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRVG4ZDEMBZHA . You are receiving this because you were mentioned.Message ID: @.***>
-- Best regards, Mark Andreev
@mrk-andreev Congratulations with your first contribution to Apache Spark!