backdrop-issues icon indicating copy to clipboard operation
backdrop-issues copied to clipboard

Retroactive transliteration of file names broken with MySQL 8.0.22+

Open jlfranklin opened this issue 4 years ago • 39 comments

Description of the bug

One test fails under Ubuntu 20.04 that does not fail under Debian 9 (php 7.0, MariaDB 10.1.47).

It may be a configuration or database issue. Database is using default char set 'utf8', not utf8mb4, and utf8mb4 is not enabled in settings.php.

Steps To Reproduce

To reproduce the behavior:

  1. Install Backdrop in an Ubuntu 20.04 server with MySQL 8.0, PHP 7.4 FPM, and Apache 2.4.41)
  2. Run the full suite of tests.

Actual behavior

---- File API: File uploading transliteration (FileUploadTransliterationTest) ----


Status    Group      Filename          Line Function                            
--------------------------------------------------------------------------------
Exception PDOExcepti system.admin.inc  2679 system_transliteration_retroactive(
    SQLSTATE[HY000]: General error: 3995 Character set 'utf8_general_ci' cannot
    be used in conjunction with 'binary' in call to regexp_like.: SELECT
    COUNT(*) AS expression
    FROM 
    (SELECT 1 AS expression
    FROM 
    {file_managed} file_managed
    WHERE  (uri NOT REGEXP BINARY :db_condition_placeholder_0) ) subquery; Array
    (
        [:db_condition_placeholder_0] => /[a-z0-9_.-]+$
    )
Fail      Other      file.test         1073 FileUploadTransliterationTest->test
    "File transliteration is not yet enabled." found
Exception PDOExcepti system.admin.inc  2679 system_transliteration_retroactive(
    SQLSTATE[HY000]: General error: 3995 Character set 'utf8_general_ci' cannot
    be used in conjunction with 'binary' in call to regexp_like.: SELECT
    COUNT(*) AS expression
    FROM 
    (SELECT 1 AS expression
    FROM 
    {file_managed} file_managed
    WHERE  (uri NOT REGEXP BINARY :db_condition_placeholder_0) ) subquery; Array
    (
        [:db_condition_placeholder_0] => /[a-z0-9_.-]+$
    )
Fail      Other      file.test         1103 FileUploadTransliterationTest->test
    Two files available for transliteration.
Fail      Other      file.test         1104 FileUploadTransliterationTest->test
    Found the Transliterate button
Fail      Other      file.test         1104 FileUploadTransliterationTest->test
    Found the requested form fields at
Fail      Other      file.test         1105 FileUploadTransliterationTest->test
    "2 file names have been successfully transliterated." found
Fail      Other      file.test         1111 FileUploadTransliterationTest->test
    File name transliterated and lower case after bulk updating.

Additional information

Add any other information that could help, such as:

  • Backdrop 1.18.1
  • Apache 2.4.41, PHP 7.4.3
  • Ubuntu 20.04

jlfranklin avatar Feb 03 '21 05:02 jlfranklin

@jlfranklin Many thanks for reporting this issue.

My suspicion is that it has something to do with some stricter query handling in MySQL 8 (and eventually recent MariaDB). I don't think it's related to utf8-general vs. utf8-mb4.

The query looks a bit strange anyway...

SELECT
    COUNT(*) AS expression
    FROM 
    (SELECT 1 AS expression
    FROM 
    {file_managed} file_managed
    WHERE  (uri NOT REGEXP BINARY :db_condition_placeholder_0) )

I mean the NOT REGEXP BINARY part. UPDATE: which turned out to be totally OK, according to the MySQL docs, at least in select queries, but maybe not anymore in subqueries?

First of all: I suspect, not only the test, but also the functionality is broken. I can't actually test locally, as I don't have MySQL 8 here on my dev.

But you can. Steps to verify:

  • Go to admin/config/media/file-system
  • Keep "Transliterate file names during upload" turned on, but
  • turn off "Lowercase transliterated file names"
  • Create a file with a name like "mYfAnCyFiLe.txt" and upload
  • Back to admin/config/media/file-system
  • Now turn on "Lowercase transliterated file names"
  • Go to admin/config/media/file-system/transliteration (retroactive transliteration)

Is the file with camel-case name listed there? (I doubt it.)

indigoxela avatar Feb 04 '21 12:02 indigoxela

Although I'm currently not able to reproduce (no MySQL 8 here), I created a PR with a slightly different syntax.

@jlfranklin does that patch change anything for your Ubuntu setup?

indigoxela avatar Feb 04 '21 14:02 indigoxela

Ah, there it is: https://dev.mysql.com/doc/relnotes/mysql/8.0/en/news-8-0-22.html

" Regular expression functions such as REGEXP_LIKE() yielded inconsistent results with binary string arguments. These functions now reject binary strings with an error. (Bug #31031886, Bug #98951, Bug #31031888, Bug #98950) "

A solution for both, MySQL 5 and 8 will be tricky. The older does not support the case sensitive switch within the regex (?-i), the newer refuses to compare to BINARY.

But maybe CAST or CONVERT could be used to get a case sensitive charset?

@jlfranklin can you try the latest variant from my PR, please?

indigoxela avatar Feb 04 '21 14:02 indigoxela

I'm currently not able to reproduce (no MySQL 8 here)

I believe I can use MySQL 8 with lando on my local. I'll try this soon as I can get to it.

klonos avatar Feb 04 '21 23:02 klonos

The PR passes on MySQL 8. (Also passes on PostgreSQL, FWIW. SQLite never passed that test, so I have no expectation of it working with this PR.)

jlfranklin avatar Feb 05 '21 00:02 jlfranklin

The PR passes on MySQL 8. (Also passes on PostgreSQL...

@jlfranklin good to know. I was a bit concerned regarding PostgreSQL.

I believe I can use MySQL 8 with lando on my local. I'll try this soon as I can get to it.

@klonos cool, many thanks! We still need to verify my suspicion that not only the test, but also the functionality is currently broken with MySQL 8 and gets fixed by the PR.

indigoxela avatar Feb 05 '21 06:02 indigoxela

OK, I've tested this:

  • provisioned a new lando instance with:
    • vanilla Backdrop 1.x (1.19.x-dev)
    • database: mysql:8.0 (which resulted in 8.0.19)
  • tested using the steps provided by @indigoxela ->

Is the file with camel-case name listed there? (I doubt it.)

Yes it was 😅

image

klonos avatar Feb 05 '21 10:02 klonos

...transliteration works w/o any error btw:

image

...and the file does actually get transliterated to myfancyfile.txt

klonos avatar Feb 05 '21 10:02 klonos

database: mysql:8.0 (which resulted in 8.0.19)

Many thanks for testing. :+1: Unfortunately... according to their changelog the breaking change came with MySQL 8.0.22.

@klonos I suppose, lando doesn't ship with that version yet?

indigoxela avatar Feb 05 '21 10:02 indigoxela

They say that they do not officially support specifying patch versions, but it is possible: https://docs.lando.dev/config/mysql.html#supported-versions

Gimme a few mins to try that...

klonos avatar Feb 05 '21 10:02 klonos

...nope. That didn't work 😅 ...I need to rest, and will return to it later today/tonight.

klonos avatar Feb 05 '21 11:02 klonos

...nope. That didn't work ...I need to rest, and will return to it later today/tonight.

Hey, no need to rush!

@jlfranklin should be able to verify, as he obviously has an affected version.

(The version seems to be 8.0.23-0ubuntu0.20.04.1.)

indigoxela avatar Feb 05 '21 11:02 indigoxela

Is the file with camel-case name listed there? (I doubt it.)

You called it. The list is empty on my Ubuntu 20.04 VM:

For reference:

  • Ubuntu 20.04.2 LTS
  • MySQL version 8.0.23-0ubuntu0.20.04.1.
  • PHP 7.4.3-4ubuntu2.4
  • Apache 2.4.41-4ubuntu3.1
  • /etc/mysql/mysql.conf.d/99-silkscreen.cnf

jlfranklin avatar Feb 05 '21 13:02 jlfranklin

With the patch from the PR, the list is empty.

Without the patch, the transliteration page returns this:

image

jlfranklin avatar Feb 05 '21 14:02 jlfranklin

@jlfranklin many thanks for verifying! I'll need to get a dev setup with MySQL 8.0.22+. That's not difficult, just a little time consuming - I need to setup an extra VM for that.

Yes, that Exception is the same as in the failing test "...cannot be used in conjunction with "binary" in call to regexp_like...".

indigoxela avatar Feb 05 '21 14:02 indigoxela

@jlfranklin I just tested the exact same steps as in my comment with my PR branch on a VM with following setup:

  • Backdrop CMS 1.19.x-dev (branch 4931-file-translit-mysql8)
  • MySQL 8.0.23-1debian10
  • Apache/2.4.38 (Debian)
  • PHP 7.3.19 (fpm)

And it works perfectly. The camel-case named file is available on admin/config/media/file-system/transliteration. And retroactive transliteration is fully functional.

I suspect, you missed something important in the steps. The Exception without the PR is expected, but the PR fixes it and also the case sensitive query.

The VM is still around, I can grant access to it (admin login) if requested.

indigoxela avatar Feb 06 '21 12:02 indigoxela

Lazy hand-patching on my part. My bad.

I just re-did the patch by downloading the diff itself, and it works, at least as far as showing the mixed-case filename. However, it still shows the all-lowercase filename after retroactive transliteration, and each additional transliteration run adds another _0 to the filename.

Enabling devel and turning on the query log, I see the query in question translates to:

SELECT COUNT(*) AS `expression` FROM (SELECT 1 AS `expression` FROM `file_managed` file_managed WHERE (`uri` NOT REGEXP :db_condition_placeholder_0) ) subquery

Running it by hand with the CAST placeholder returns an error if not quoted, and 1 if quoted.

mysql> SELECT COUNT(*) AS expression FROM (SELECT 1 AS expression FROM `file_managed` file_managed WHERE (uri NOT REGEXP CAST("/[a-z0-9_.-]+$" AS CHAR CHARACTER SET latin1_general_cs) )) subquery;
ERROR 1115 (42000): Unknown character set: 'latin1_general_cs'
mysql> SELECT COUNT(*) AS expression FROM (SELECT 1 AS expression FROM `file_managed` file_managed WHERE (uri NOT REGEXP 'CAST("/[a-z0-9_.-]+$" AS CHAR CHARACTER SET latin1_general_cs)' )) subquery;
+------------+
| expression |
+------------+
|          1 |
+------------+
1 row in set (0.01 sec)

mysql> select uri from file_managed;
+--------------------------+
| uri                      |
+--------------------------+
| public://myfancyfile.txt |
+--------------------------+
1 row in set (0.00 sec)

jlfranklin avatar Feb 06 '21 18:02 jlfranklin

However, it still shows the all-lowercase filename after retroactive transliteration

Confirmed. The CAST approach seems a little too dirty, anyway. Next approach: version comparison. My PR has been updated.

Successfully tested with MySQL 8.0.23 and MariaDB 10.3.27 - needs more testing.

BTW: if anyone knows a smarter solution, please post here. :wink:

indigoxela avatar Feb 08 '21 07:02 indigoxela

Some more info about the flag now used:

MariaDB calls these flags "(default)_regex_flags" and they have been introduce in version 10.0.11

MySQL calls these flags "match_type" and it's totally unclear, when they have been introduced. Probably in 8.0.x.

indigoxela avatar Feb 08 '21 07:02 indigoxela

It seems that @jlfranklin needs explicit annotation to get a notification mail. :smiley:

Does the latest approach (version check) work for you?

indigoxela avatar Feb 10 '21 12:02 indigoxela

It seems that @jlfranklin needs explicit annotation to get a notification mail. smiley

A slower week at work would help, too.

Does the latest approach (version check) work for you?

It passes the tests against MySQL 5.5, MySQL 8, and PostgreSQL.9.6. As for the patch itself, I appreciate checking the driver is 'mysql' before checking the version. That makes it easier to apply to Silkscreen.

That kind of conditional is something I think belongs in the core/database/mysql code, but I don't see a way to do it without creating a new API at the database layer. This seems to be the only place in core we use a REGEXP condition. (I'm a little surprised it's not used in search.)

What do you about adding some regexpCondition() functions to the API?

jlfranklin avatar Feb 12 '21 06:02 jlfranklin

A slower week at work would help, too.

I see... :smile:

I appreciate checking the driver is 'mysql' before checking the version. That makes it easier to apply to Silkscreen.

That was actually the intention behind it.

This seems to be the only place in core we use a REGEXP condition. ... adding some regexpCondition() functions to the API?

This is the only place, that's also my finding. Enhancing the API... - I'm not sure if it's worth the effort for a single query.

indigoxela avatar Feb 12 '21 07:02 indigoxela

This is the only place, that's also my finding. Enhancing the API... - I'm not sure if it's worth the effort for a single query.

I'm on the fence myself, leaning on the side of enhancing to keep this code clean. It's part of the SQL spec, and it takes some extra flags that are hard at best to express in a DB-agnostic way with condition(), short of pre-parsing the regex to extract the flags.

I'm OK with doing it as a follow-up issue. I'm also interested in hearing what the @backdrop/core-committers and PMC think.

jlfranklin avatar Feb 12 '21 19:02 jlfranklin

Let's not block ourselves waiting for a Core Committer or PMC member to comment here. A member of PMC set the milestone, and I'm optimistic enough to interpret that as approval. :wink: At least no objection.

A (potential) db layer enhancement can be a follow-up, anyway. And the code will be reviewed before merging.

@jlfranklin as you opened the issue, you're the perfect candidate to verify that the PR is working properly - not only the functional test, but also the functionality. If so, please set the label "works for me". Then we can proceed with the review process.

indigoxela avatar Feb 20 '21 11:02 indigoxela

I think the code looks fine. Doesn't look like it's stopping us from doing future abstraction if necessary.

By the way, I was confused about the mention of testing with Postgres and Sqlite. Backdrop officially only supports MySQL (and MariaDB) so it's officially ok if it breaks other dbs.

herbdool avatar Dec 31 '21 17:12 herbdool

I can finally test this with Lando now!...

For anyone else that wants to know how, please see: https://github.com/lando/lando/issues/2948#issuecomment-1003484299

klonos avatar Jan 01 '22 02:01 klonos

I've installed a vanilla Backdrop instance, using:

  • latest 1.x
  • php 7.4.26
  • apache 2.4.51
  • mysql 8.0.22

I then tried the steps to reproduce provided by @indigoxela, but when I navigated to admin/config/media/file-system/transliteration (retroactive transliteration) I got the same error as @jlfranklin in https://github.com/backdrop/backdrop-issues/issues/4931#issuecomment-774064812:

PDOException: SQLSTATE[HY000]: General error: 3995 Character set 'utf8mb4_0900_ai_ci'
cannot be used in conjunction with 'binary' in call to regexp_like.:
SELECT COUNT(*) AS `expression` FROM (SELECT 1 AS `expression` FROM {file_managed} file_managed
WHERE (`uri` NOT REGEXP BINARY :db_condition_placeholder_0) ) subquery;
Array ( [:db_condition_placeholder_0] => /[a-z0-9_.-]+$ ) in system_transliteration_retroactive()
(line 2683 of /app/docroot/core/modules/system/system.admin.inc).

I then destroyed the instance, and reinstalled with the changes in the PR. When I tried the steps to reproduce, I did not hit that PDOException, but the list in admin/config/media/file-system/transliteration was empty:

image

I then destroyed the instance once again, and installed again, this time with mysql 8.0.23 + the changes from the PR. This time when visiting admin/config/media/file-system/transliteration the file was listed there. Transliteration worked as expected:

image

[edit] please ignore my previous report re upload errors after deleting and trying to re-upload the transliterated file. This seems to be a pre-existing issue, and mot specific to mysql: #5424.

So to summarize:

  • with mysql 8.0.22 and vanilla Backdrop 1.x -> PDOException 👎🏼
  • with mysql 8.0.22 and the changes from the PR applied -> no PDOException 👍🏼 but empty list of to-be-transliterated files 👎🏼
  • with mysql 8.0.23+ (or mariadb) and the changes from the PR applied -> no PDOException 👍🏼 and the file listed in the to-be-transliterated list 👍🏼

As such, I'm setting this back to "needs work". If the empty list in admin/config/media/file-system/transliteration with mysql v8.0.22 is expected (due to the bug in mysql), then please feel free to set this back to RTBC.

klonos avatar Jan 01 '22 03:01 klonos

As such, I'm setting this back to "needs work".

I'm a bit confused. @klonos you already realized that the problem you encountered is unrelated to the problem we try to fix here. And still you reject this fix? Mind to explain, why? Or am I missing the point?

Anyway, rebased. I didn't change the lables and leave it to someone else to decide.

indigoxela avatar Jan 01 '22 10:01 indigoxela

Sorry for the confusion @indigoxela ...I've set it back to "needs work" because of this:

  • with mysql 8.0.22 and the changes from the PR applied -> no PDOException 👍🏼 but empty list of to-be-transliterated files 👎🏼

...but as I explained:

If the empty list in admin/config/media/file-system/transliteration with mysql v8.0.22 is expected (due to the bug in mysql), then please feel free to set this back to RTBC.

klonos avatar Jan 01 '22 16:01 klonos

I tried to switch the GHA workflow to the exact version (8.0.22), but that's not supported. The currently installed 8 version with that action is 8.0.27, so I run the functional tests with that one. (Cool GHA!) Note that php 5.3 is supposed to completely fail with MySQL 8 - that's nothing to fix or even consider.

The transliteration test passes with version 8.0.27, but there's another test failing, which points to another problem with either the test or the functionality behind it: DatabaseTemporaryQueryTestCase

TBH, currently my motivation to keep on working on this PR is rather low. Even lower if I have to hunt some possible bug in an outdated MySQL version (8.0.22). The effort to install specific mysql versions is rather high for me and personally, I don't have any websites out there running on version 8.

If anyone wants to take over, please do!

I guess, DatabaseTemporaryQueryTestCase needs a new issue - that problem is unrelated here.

indigoxela avatar Jan 02 '22 14:01 indigoxela