node-mysql2 icon indicating copy to clipboard operation
node-mysql2 copied to clipboard

There might be a conversion from UTF to ASCII behind the scenes that generates incorrect values

Open maximedupre opened this issue 7 years ago • 6 comments

As explained in this ticket, I'm trying to insert a row with the string "kické-accent.wav".

My table uses the "latin1_swedish_ci" collation and that seems juste fine, as I am able to insert the row without a problem using the MySQL Workbench.

However, when attempting to create the row with mysql2, I get the following error:

Error: Incorrect string value: '\xCC\x81-acc...' for column 'name' at row 1
at PromiseConnection.query (.../node_modules/mysql2/promise.js:92:22)

The code that inserts the row looks like this:

await db.query('INSERT INTO files SET ?', [{ extension: 'wav', name: 'kické-accent.wav', user_id: 3706 }]);

What is going on under the hood that prevents me from inserting a row just like I would do manually using the MySQL Workbench?

Cheers!

maximedupre avatar Apr 11 '19 23:04 maximedupre

This could be actually an issue with sqlstring module doing escape in input. While I'm checking this, could you try to do query using .execute() ? ( everything else being the same )

sidorares avatar Apr 12 '19 00:04 sidorares

can you show files schema? Trying to prepare fully self contained example to reproduce the issue

sidorares avatar Apr 12 '19 00:04 sidorares

Hey, thanks for helping. I tried the query with .execute() and the error is the same (I had to change the syntax a bit, because SET ? doesn't work with .execute(), you need to unpack the object values like: SET column1=?, column2=?, ...).

files schema:

CREATE TABLE `files` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`project_id` int(10) unsigned NOT NULL,
`user_id` int(10) unsigned NOT NULL,
`name` varchar(200) NOT NULL,
`extension` varchar(50) NOT NULL,
`size` int(10) unsigned NOT NULL,
`creation_date` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
KEY `files_project_id_foreign` (`project_id`),
CONSTRAINT `files_project_id_foreign` FOREIGN KEY (`project_id`) REFERENCES `projects` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=4 DEFAULT CHARSET=latin1

EDIT

Here is the specific error:

{
	"message": "Incorrect string value: '\\xCC\\x81-acc...' for column 'name' at row 1",
	"code": "ER_TRUNCATED_WRONG_VALUE_FOR_FIELD",
	"errno": 1366,
	"sqlState": "HY000",
	"sqlMessage": "Incorrect string value: '\\xCC\\x81-acc...' for column 'name' at row 1"
}

maximedupre avatar Apr 12 '19 01:04 maximedupre

@maximedupre add charset: 'LATIN1_SWEDISH_CI' to connection options

This is what I think happens: Default connection encoding server decides to use is 'utf8' ( which in mysql not a real UTF-8 but something that was invented before utf8 was standartised ). "é" is surrogate pair on JS side which I believe mysql "utf8" can't understand

> "é".codePointAt(0)
101
> "é".codePointAt(1)
769
>

( might be wrong in explanation, correct me if you have better one )

Also charset: 'UTF8MB4_UNICODE_CI' should work as well ( UTF8MB4 is "real" utf-8 while "UTF8" in mysql is cesu-8 - https://en.wikipedia.org/wiki/CESU-8 ). I'm sure your Workbench connects using UTF8MB4_UNICODE_CI

sidorares avatar Apr 12 '19 05:04 sidorares

I tried with charset: 'LATIN1_SWEDISH_CI' and there is no more error, but the string is saved without accent in the DB ("kicke -accent.wav" instead of "kické-accent.wav").

charset: 'UTF8MB4_UNICODE_CI' causes the same error as when no charset is specified in the connection options.

EDIT

Btw, I have a different result when checking the code point for the "é":

> "é".codePointAt(0);
233
> "é".codePointAt(1);
undefined

233 actually maps to the "é" in the latin1 charset and the unicode charset

EDIT 2

Oh wow, apparently there's a difference between "é" and "é" (try them in your node REPL). The same letter but encoded differently? Now that's confusing.

EDIT 3 Made it work with charset: 'UTF8MB4' as opposed to charset: 'UTF8MB4_UNICODE_CI'. You might wanna add a warning or error message when passing in an invalid charset. Right now, nothing happens, which can lead to a lot of confusion as you can see ^^. I'll leave this ticket open for this reason, but close it if you wish to handle the warnings separately.

EDIT 4 Actually, using the charset: 'UTF8MB4' connection option doesn't solve the problem, unless you create the schema using the connection, in which case the table will have a UTF8MB4_GENERAL_CI charset by default instead of a LATIN1_SWEDISH_CI charset

EDIT 5 Ended up converting the collation and charset of my DB and all it's tables to utf8mb4_unicode_ci and now everything works fine. UTF8 is a better, modern choice anyway according to all the research I did.

maximedupre avatar Apr 12 '19 13:04 maximedupre

We also encounter this problem. Everything, the whole DB, all tables and all fields are utf8mb4 / utf8mb4_unicode_ci. What we do is write an umlaut in one table (works), then later download it and insert it into another table, which results in getting the same error. We added tons of fallbacks in the code to ensure the encoding is fine, but it still keeps occuring. Any clue how/why/what is going on?

zentek-markus avatar Oct 21 '25 10:10 zentek-markus