spark [SPARK-49044][SQL] ValidateExternalType should return child in error

Jira ticket: https://issues.apache.org/jira/browse/SPARK-49044

What changes were proposed in this pull request?

In this PR was an extended error message for checkType in ValidateExternalType with information about child nodes.

Why are the changes needed?

When we have mixed schema rows, the error message "{actual} is not a valid external type for schema of {expected}" doesn't help to understand the column with the problem. I suggest adding information about the source column.

Example

class ErrorMsgSuite extends AnyFunSuite with SharedSparkContext {
  test("shouldThrowSchemaError") {
    val seq: Seq[Row] = Seq(
      Row(
        toBytes("0"),
        toBytes(""),
        1L,
      ),
      Row(
        toBytes("0"),
        toBytes(""),
        1L,
      ),
    )    val schema: StructType = new StructType()
      .add("f1", BinaryType)
      .add("f3", StringType)
      .add("f2", LongType)    val df = sqlContext.createDataFrame(sqlContext.sparkContext.parallelize(seq), schema)    val exception = intercept[RuntimeException] {
      df.show()
    }    assert(
      exception.getCause.getMessage
        .contains("[B is not a valid external type for schema of string")
    )
    assertResult(
      "[B is not a valid external type for schema of string"
    )(exception.getCause.getMessage)
  }  def toBytes(x: String): Array[Byte] = x.toCharArray.map(_.toByte)
}

After fix error message may contain extra info

[B is not a valid external type for schema of string at getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 1, f3)

Example: https://github.com/mrk-andreev/example-spark-schema/blob/main/spark_4.0.0/src/test/scala/ErrorMsgSuite.scala

Does this PR introduce any user-facing change?

This changes extends error message in case of "not a valid external type".

Example before

java.lang.Integer is not a valid external type for schema of double

Example after

java.lang.Integer is not a valid external type for schema of double at getexternalrowfield(input[0, org.apache.spark.sql.Row, true], 0, c0)

How was this patch tested?

This changes covered by new unit test test("SPARK-49044 ValidateExternalType should return child in error")

Was this patch authored or co-authored using generative AI tooling?

No

Jul 29 '24 16:07 mrk-andreev

When encountering mixed schema rows, the current error message "{actual} is not a valid external type for schema of {expected}" lacks sufficient detail to identify the problematic column. This ambiguity hinders troubleshooting and increases development time.

To enhance error clarity, we propose incorporating the source column name into the error message. For example: "Column 'my_column' has an actual type of {actual} which is not a valid external type for the expected schema of {expected}."

By providing this additional context, developers can more efficiently pinpoint and resolve schema mismatches.

Aug 21 '24 22:08 michTalebzadeh

+1, LGTM. Merging to master. Thank you, @mrk-andreev.

Sep 03 '24 06:09 MaxGekk

@mrk-andreev BTW, could you update PR's description according the recent changes, please. For instance:

After fix error message may contain extra info

[B is not a valid external type for schema of string at getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 1, f3)

Sep 03 '24 06:09 MaxGekk

Thank you Maxim,

I extended the PR description with the correct error message and also added into "Why are the changes needed?" a short example presented by you.

On Tue, 3 Sept 2024 at 07:44, Maxim Gekk @.***> wrote:

@mrk-andreev https://github.com/mrk-andreev BTW, could you update PR's description according the recent changes, please. For instance:

After fix error message may contain extra info

[B is not a valid external type for schema of string at getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 1, f3)

— Reply to this email directly, view it on GitHub https://github.com/apache/spark/pull/47522#issuecomment-2325722098, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFDNI27ECMWP76MUJGDKKLZUVLEZAVCNFSM6AAAAABLUUBVMSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRVG4ZDEMBZHA . You are receiving this because you were mentioned.Message ID: @.***>

-- Best regards, Mark Andreev

Sep 03 '24 07:09 mrk-andreev

@mrk-andreev Congratulations with your first contribution to Apache Spark!

Sep 03 '24 11:09 MaxGekk