hudi icon indicating copy to clipboard operation
hudi copied to clipboard

Incoming batch schema is not compatible with the table's one

Open njalan opened this issue 2 years ago • 13 comments

I got below exception when ingest data from sql server into hudi. org.apache.hudi.exception.SchemaCompatibilityException: Incoming batch schema is not compatible with the table's one at org.apache.hudi.HoodieSparkSqlWriter$.deduceWriterSchema(HoodieSparkSqlWriter.scala:496) at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:314) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:150) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)

Source table ddl is: -- auto-generated definition create table Address ( Id int identity constraint [xxxx] primary key, Line1 nvarchar(128), Line2 nvarchar(128), ccode nvarchar(2) not null constraint [xxxx] references Country, XEID int not null, cbUser nvarchar(48), MuUser int not null, MyUser nvarchar(48), CreateDate datetime not null, Latitude decimal(12, 9), Longitude decimal(12, 9) )

Environment Description

Hudi version : 0.9

Spark version : 3.0.1

Hive version : 3.1

Hadoop version : 3.2.2

Storage (HDFS/S3/GCS..) :

Running on Docker? no :

njalan avatar Nov 03 '23 08:11 njalan

@njalan It happens when the source schema is not backward compatible to Hudi table schema. Can you give us more insights what schema changes you are getting.

ad1happy2go avatar Nov 03 '23 14:11 ad1happy2go

@ad1happy2go ,why there are Incoming schema (canonicalized). and Table's schema is not the source table schema and only Incoming schema is the source table schema Below are the schema detail: Incoming schema { "type" : "record", "name" : "address_record", "namespace" : "hoodie.address", "fields" : [ { "name" : "Id", "type" : [ "null", "int" ], "default" : null }, { "name" : "Line1", "type" : [ "null", "string" ], "default" : null }, { "name" : "Line2", "type" : [ "null", "string" ], "default" : null }, { "name" : "City", "type" : [ "null", "string" ], "default" : null }, { "name" : "State", "type" : [ "null", "string" ], "default" : null }, { "name" : "Zip", "type" : [ "null", "string" ], "default" : null }, { "name" : "CountryCode", "type" : [ "null", "string" ], "default" : null }, { "name" : "CreateByUserID", "type" : [ "null", "int" ], "default" : null }, { "name" : "CreateByUser", "type" : [ "null", "string" ], "default" : null }, { "name" : "ModifyByUserID", "type" : [ "null", "int" ], "default" : null }, { "name" : "ModifyByUser", "type" : [ "null", "string" ], "default" : null }, { "name" : "CreateDate", "type" : [ "null", { "type" : "long", "logicalType" : "timestamp-micros" } ], "default" : null }, { "name" : "ModifyDate", "type" : [ "null", { "type" : "long", "logicalType" : "timestamp-micros" } ], "default" : null }, { "name" : "Latitude", "type" : [ "null", { "type" : "fixed", "name" : "fixed", "namespace" : "hoodie.address.address_record.Latitude", "size" : 6, "logicalType" : "decimal", "precision" : 12, "scale" : 9 } ], "default" : null }, { "name" : "Longitude", "type" : [ "null", { "type" : "fixed", "name" : "fixed", "namespace" : "hoodie.address.address_record.Longitude", "size" : 6, "logicalType" : "decimal", "precision" : 12, "scale" : 9 } ], "default" : null } ] } Incoming schema (canonicalized) { "type" : "record", "name" : "address_record", "namespace" : "hoodie.address", "fields" : [ { "name" : "Id", "type" : [ "null", "int" ], "default" : null }, { "name" : "Line1", "type" : [ "null", "string" ], "default" : null }, { "name" : "Line2", "type" : [ "null", "string" ], "default" : null }, { "name" : "City", "type" : [ "null", "string" ], "default" : null }, { "name" : "State", "type" : [ "null", "string" ], "default" : null }, { "name" : "Zip", "type" : [ "null", "string" ], "default" : null }, { "name" : "CountryCode", "type" : [ "null", "string" ], "default" : null }, { "name" : "CreateByUserID", "type" : [ "null", "int" ], "default" : null }, { "name" : "CreateByUser", "type" : [ "null", "string" ], "default" : null }, { "name" : "ModifyByUserID", "type" : [ "null", "int" ], "default" : null }, { "name" : "ModifyByUser", "type" : [ "null", "string" ], "default" : null }, { "name" : "CreateDate", "type" : [ "null", { "type" : "long", "logicalType" : "timestamp-micros" } ], "default" : null }, { "name" : "ModifyDate", "type" : [ "null", { "type" : "long", "logicalType" : "timestamp-micros" } ], "default" : null }, { "name" : "Latitude", "type" : [ "null", { "type" : "fixed", "name" : "fixed", "namespace" : "hoodie.address.address_record.Latitude", "size" : 6, "logicalType" : "decimal", "precision" : 12, "scale" : 9 } ], "default" : null }, { "name" : "Longitude", "type" : [ "null", { "type" : "fixed", "name" : "fixed", "namespace" : "hoodie.address.address_record.Longitude", "size" : 6, "logicalType" : "decimal", "precision" : 12, "scale" : 9 } ], "default" : null } ] } Table's schema { "type" : "record", "name" : "address_record", "namespace" : "hoodie.address", "fields" : [ { "name" : "addressid", "type" : [ "null", "string" ], "doc" : "from deserializer", "default" : null }, { "name" : "addressline1", "type" : [ "null", "string" ], "doc" : "from deserializer", "default" : null }, { "name" : "addressline2", "type" : [ "null", "string" ], "doc" : "from deserializer", "default" : null }, { "name" : "countrycode", "type" : [ "null", "string" ], "doc" : "from deserializer", "default" : null }, { "name" : "admin1code", "type" : [ "null", "string" ], "doc" : "from deserializer", "default" : null }, { "name" : "admin2code", "type" : [ "null", "string" ], "doc" : "from deserializer", "default" : null }, { "name" : "admin3code", "type" : [ "null", "string" ], "doc" : "from deserializer", "default" : null }, { "name" : "postalcode", "type" : [ "null", "string" ], "doc" : "from deserializer", "default" : null }, { "name" : "transactiondate", "type" : [ "null", "string" ], "doc" : "from deserializer", "default" : null }, { "name" : "createdby", "type" : [ "null", "string" ], "doc" : "from deserializer", "default" : null }, { "name" : "createdate", "type" : [ "null", "string" ], "doc" : "from deserializer", "default" : null }, { "name" : "modifydate", "type" : [ "null", "string" ], "doc" : "from deserializer", "default" : null }, { "name" : "modifiedby", "type" : [ "null", "string" ], "doc" : "from deserializer", "default" : null } ] }

njalan avatar Nov 04 '23 13:11 njalan

@njalan I can see the table schema is completely different than incoming schema. canonicalized schema is identical to incoming schema.

Is your incoming schema supposed to be different than table schema? You may need to transform the schema before upsetting to hudi.

ad1happy2go avatar Nov 04 '23 17:11 ad1happy2go

@ad1happy2go I removed hudi table and also removed all files but still got the same error messages. But if I rename the target table from address to address_1 then spark job is running successfully.

njalan avatar Nov 05 '23 11:11 njalan

@njalan That means the old data was not getting deleted properly I guess with name 'address'. Can you confirm once.

ad1happy2go avatar Nov 06 '23 05:11 ad1happy2go

@ad1happy2go I am sure I have totally removed all the data files. I tested many times. It is wired that how this Table's schema generated and it is totally different from source table.

njalan avatar Nov 06 '23 07:11 njalan

@njalan If it works when you change your table name address1, then ideally there should be some residual for old run for address.. In case you able to reproduce this with sample scenario let us know. Thanks.

ad1happy2go avatar Nov 06 '23 17:11 ad1happy2go

@ad1happy2go I have another table with name as address in another schema have the same issues. but If I use hudi 0.9 it is working fine. But it is not working with hudi 0.13. I think there is a bug in hudi 0.13.

njalan avatar Nov 07 '23 02:11 njalan

@njalan Interesting, Thanks for all the effort. Although I can't think of any reasoning behind it. It would be really helpful if you can provide some dummy data or sample code which I can try to reproduce.

ad1happy2go avatar Nov 07 '23 13:11 ad1happy2go

@ad1happy2go I debug the source code and got the reason. My target table is testing.address and I didn't add the hoodie.database.name but there is one existing table default. address. In my case it will get the default.address schema as Table's schema. But they are two total different table. I need to add the hoodie.database.name for each table? Why not take the sync database name as the hudi database name? it is working fine in hudi 0.9. Is is a bug or can I raise a PR for using the sync database name as the hudi database name if not specified?

njalan avatar Dec 11 '23 11:12 njalan

@njalan Great! Nice catch. Actually it go other way around, as it sets sync database name if its not set. I guess we can raise that PR to fix it. Thanks for the contribution.

ad1happy2go avatar Dec 11 '23 12:12 ad1happy2go

@ad1happy2go I just raised one PR https://github.com/apache/hudi/pull/10308. Can you please kindly review it and it is my first pr for hudi. I am not sure it can be merged or not.

njalan avatar Dec 12 '23 05:12 njalan

I checked @danny is following up on this.

ad1happy2go avatar Dec 19 '23 16:12 ad1happy2go