clickhouse-java icon indicating copy to clipboard operation
clickhouse-java copied to clipboard

Unknown data type: int64 when integrated with spark project

Open shaokunW opened this issue 3 years ago • 9 comments

Hi, I got an exception when the data was saved to clickhouse by sparkSQL with clickhouse-jdbc.

This would happen, when the type of the columns was Array[Int64]. And the SparkSql will handle the array type using following codes: spark code

The nested type Int64 of Array type, was converted into its lowercase form int64, and then it was passed as a parameter to the com.clickhouse.jdbc.ClickHouseConnection.createArrayOf. And When I read the source code of clickhouse-jdbc, I found that ClickHouseDataType used enum names for case-sensitive types for matching. So I will suggest that when we do the match, we use the upperCase form for all clickhouse data types for compatibility and I raise a PR for that

Caused by: java.lang.IllegalArgumentException: Unknown data type: int64 at com.clickhouse.client.ClickHouseDataType.of(ClickHouseDataType.java:231) at com.clickhouse.client.ClickHouseColumn.readColumn(ClickHouseColumn.java:361) at com.clickhouse.client.ClickHouseColumn.of(ClickHouseColumn.java:407) at com.clickhouse.jdbc.ClickHouseConnection.createArrayOf(ClickHouseConnection.java:42) at com.clickhouse.jdbc.ClickHouseConnection.createArrayOf(ClickHouseConnection.java:25) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeSetter$15(JdbcUtils.scala:624) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeSetter$15$adapted(JdbcUtils.scala:621) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:715) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1(JdbcUtils.scala:890) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1$adapted(JdbcUtils.scala:888) at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2(RDD.scala:1020) at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2$adapted(RDD.scala:1020) at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2254) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)

shaokunW avatar Aug 14 '22 09:08 shaokunW

Hi @shaokunW, thanks for reporting the issue and providing a quick fix. Is there any good reason Spark enforcing nested type name in lower case?

zhicwu avatar Aug 14 '22 10:08 zhicwu

@zhicwu Sorry, I don't have a deep understanding of the Spark framework. Most databases accept both upper and lower case when defining a column in a DDL statement, so I think this conversion is just for formatting purposes.

I have found that Clickhouse works well for ints, such as int, int, int, but not for 'int64'. I have also seen data type definitions and aliases from 'select * from system.data_type_families'.. I'm confused about the format consistency of the data types.

shaokunW avatar Aug 14 '22 13:08 shaokunW

@shaokunW, ClickHouse supports both case-sensitive and case-insensitive data types. You may get the full list from system.data_type_families. The implementation in Spark has 3 problems to me: 1) enforcing case-insensitive data type, which might not be suitable for all databases; 2) only part of nested type name was passed to JDBC driver, meaning you'll have problem dealing with cases like Array(Nullable(String)), Array(LowCardinality(String)), Array(Tuple(Int8,String)) etc.; and most importantly 3) lack of support for Object.

Anyway, I think we can introduce a JDBC-specific option caseInsensitiveTypeName (defaults to false) and move the changes to clickhouse-jdbc. JbdcDialect should be implemented in the future to support more data types.

zhicwu avatar Aug 15 '22 02:08 zhicwu

@zhicwu Aggred with that. It is not proper to alter the nested type again since the operation has been delegated to JdbcDialect implementation. Yes, a option field can work for this, but which datatypes are suitable to has a true caseInsensitiveTypeName this time? It would be great to have its implementation for JdbcDialect of Spark.

shaokunW avatar Aug 16 '22 13:08 shaokunW

Spark JDBC implementation does not support nested data type well, so I guess it's better to implement a JdbcDialect instead of adding workaround in JDBC driver. I think you can take this as an example to register your own dialect(with enhanced ArrayType support).

zhicwu avatar Aug 16 '22 13:08 zhicwu

Thanks. The example you shared has been integrated into my project to support nested data type like Array. However, when the type is Array, Spark will call createArrayOf method of ClickhouseConnection with lowercase type name, which can be avoided..

shaokunW avatar Aug 16 '22 14:08 shaokunW

@zhicwu For JDBC-specific option caseInsensitiveTypeName, you mean to add a config option into com.clickhouse.jdbc.JdbcConfig?

shaokunW avatar Aug 17 '22 07:08 shaokunW

@zhicwu For JDBC-specific option caseInsensitiveTypeName, you mean to add a config option into com.clickhouse.jdbc.JdbcConfig?

Sorry for the late reply. Yes, in JdbcConfig so it won't impact Java client. On a second thought, it might be better to provide a mapping between ClickHouse native data type and standardized case-insensitive types, so that it's easier for Spark to understand. There's an option for renaming response column, I think I can add one more to rename column type as well.

zhicwu avatar Aug 19 '22 00:08 zhicwu

Any updates on this issue?

f1llon avatar Jan 23 '24 18:01 f1llon