Unknown data type: int64 when integrated with spark project
Hi, I got an exception when the data was saved to clickhouse by sparkSQL with clickhouse-jdbc.
This would happen, when the type of the columns was Array[Int64]. And the SparkSql will handle the array type using following codes: spark code
The nested type Int64 of Array type, was converted into its lowercase form int64, and then it was passed as a parameter to the com.clickhouse.jdbc.ClickHouseConnection.createArrayOf. And When I read the source code of clickhouse-jdbc, I found that ClickHouseDataType used enum names for case-sensitive types for matching. So I will suggest that when we do the match, we use the upperCase form for all clickhouse data types for compatibility and I raise a PR for that
Caused by: java.lang.IllegalArgumentException: Unknown data type: int64 at com.clickhouse.client.ClickHouseDataType.of(ClickHouseDataType.java:231) at com.clickhouse.client.ClickHouseColumn.readColumn(ClickHouseColumn.java:361) at com.clickhouse.client.ClickHouseColumn.of(ClickHouseColumn.java:407) at com.clickhouse.jdbc.ClickHouseConnection.createArrayOf(ClickHouseConnection.java:42) at com.clickhouse.jdbc.ClickHouseConnection.createArrayOf(ClickHouseConnection.java:25) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeSetter$15(JdbcUtils.scala:624) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeSetter$15$adapted(JdbcUtils.scala:621) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:715) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1(JdbcUtils.scala:890) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1$adapted(JdbcUtils.scala:888) at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2(RDD.scala:1020) at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2$adapted(RDD.scala:1020) at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2254) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
Hi @shaokunW, thanks for reporting the issue and providing a quick fix. Is there any good reason Spark enforcing nested type name in lower case?
@zhicwu Sorry, I don't have a deep understanding of the Spark framework. Most databases accept both upper and lower case when defining a column in a DDL statement, so I think this conversion is just for formatting purposes.
I have found that Clickhouse works well for ints, such as int, int, int, but not for 'int64'. I have also seen data type definitions and aliases from 'select * from system.data_type_families'.. I'm confused about the format consistency of the data types.
@shaokunW, ClickHouse supports both case-sensitive and case-insensitive data types. You may get the full list from system.data_type_families. The implementation in Spark has 3 problems to me: 1) enforcing case-insensitive data type, which might not be suitable for all databases; 2) only part of nested type name was passed to JDBC driver, meaning you'll have problem dealing with cases like Array(Nullable(String)), Array(LowCardinality(String)), Array(Tuple(Int8,String)) etc.; and most importantly 3) lack of support for Object.
Anyway, I think we can introduce a JDBC-specific option caseInsensitiveTypeName (defaults to false) and move the changes to clickhouse-jdbc. JbdcDialect should be implemented in the future to support more data types.
@zhicwu Aggred with that. It is not proper to alter the nested type again since the operation has been delegated to JdbcDialect implementation. Yes, a option field can work for this, but which datatypes are suitable to has a true caseInsensitiveTypeName this time? It would be great to have its implementation for JdbcDialect of Spark.
Spark JDBC implementation does not support nested data type well, so I guess it's better to implement a JdbcDialect instead of adding workaround in JDBC driver. I think you can take this as an example to register your own dialect(with enhanced ArrayType support).
Thanks. The example you shared has been integrated into my project to support nested data type like Array. However, when the type is Array, Spark will call createArrayOf method of ClickhouseConnection with lowercase type name, which can be avoided..
@zhicwu For JDBC-specific option caseInsensitiveTypeName, you mean to add a config option into com.clickhouse.jdbc.JdbcConfig?
@zhicwu For JDBC-specific option caseInsensitiveTypeName, you mean to add a config option into com.clickhouse.jdbc.JdbcConfig?
Sorry for the late reply. Yes, in JdbcConfig so it won't impact Java client. On a second thought, it might be better to provide a mapping between ClickHouse native data type and standardized case-insensitive types, so that it's easier for Spark to understand. There's an option for renaming response column, I think I can add one more to rename column type as well.
Any updates on this issue?