spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-36663] [FOLLOWUP] [SQL] Support number-only column names in ORC data sources when orc impl is hive

Open mcdull-zhang opened this issue 3 years ago • 2 comments

What changes were proposed in this pull request?

This PR aims to support number-only column names in ORC data sources when orc impl is hive. In the current master, with ORC datasource, we can write a DataFrame which contains such columns into ORC files.

spark.sql("SELECT 'a' as `1`, 'b' as `2`, 'c' as `3`").write.orc(path)

But reading the ORC files will fail.

val df = spark.read.orc(path)
...
== SQL ==
struct<1:string,2:string,3:string>
-------^^^

	at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:265)
	at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:126)
	at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseDataType(ParseDriver.scala:40)
	at org.apache.spark.sql.hive.orc.OrcFileOperator$$anonfun$readSchema$2.applyOrElse(OrcFileOperator.scala:101)

The cause of this is CatalystSqlParser.parseDataType fails to parse if a column name (and nested field) consists of only numbers.

Why are the changes needed?

For better usability.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit Tests.

mcdull-zhang avatar Aug 08 '22 12:08 mcdull-zhang

Can one of the admins verify this patch?

AmplabJenkins avatar Aug 08 '22 20:08 AmplabJenkins

@cloud-fan please take a look

mcdull-zhang avatar Aug 09 '22 05:08 mcdull-zhang

@dongjoon-hyun I created a new JIRA, please take a look

mcdull-zhang avatar Aug 15 '22 03:08 mcdull-zhang

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions[bot] avatar Nov 24 '22 00:11 github-actions[bot]