tidb icon indicating copy to clipboard operation
tidb copied to clipboard

Generated columns index replacement rule not support JSON functions like MySQL

Open sunnyfun888 opened this issue 3 years ago • 8 comments

Bug Report

Please answer these questions before submitting your issue. Thanks!

In MySQL, use this way, optimizer will look for compatible virtual columns with indexes that match the expression in JSON queries:

mysql> SELECT VERSION();
+-----------+
| VERSION() |
+-----------+
| 8.0.26    |
+-----------+
1 row in set (0.00 sec)

mysql> CREATE TABLE person (
    -> id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
    -> NAME VARCHAR(255) NOT NULL,
    -> address_info JSON,
    -> city VARCHAR(64) AS (JSON_UNQUOTE(address_info->"$.city")),
    -> KEY (city)
    -> );
Query OK, 0 rows affected (0.02 sec)

mysql> INSERT INTO `person` (`id`, `NAME`, `address_info`) VALUES('1','a','{\"city\": \"Beijing\"}');
Query OK, 1 row affected (0.01 sec)

mysql> EXPLAIN ANALYZE SELECT NAME, id, city FROM person WHERE address_info->>"$.city" = 'Beijing';
+----------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                              |
+----------------------------------------------------------------------------------------------------------------------+
| -> Index lookup on person using city (city='Beijing')  (cost=0.35 rows=1) (actual time=0.026..0.029 rows=1 loops=1)
|
+----------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

But same SQL in TiDB, the query will be TableFullScan:

SELECT VERSION();
+--------------------+
| VERSION()          |
+--------------------+
| 5.7.25-TiDB-v6.1.0 |
+--------------------+
1 row in set (0.00 sec)

mysql> CREATE TABLE person (
    -> id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
    -> NAME VARCHAR(255) NOT NULL,
    -> address_info JSON,
    -> city VARCHAR(64) AS (JSON_UNQUOTE(address_info->"$.city")),
    -> KEY (city)
    -> );

EXPLAIN ANALYZE SELECT NAME, id, city FROM person WHERE address_info->>"$.city" = 'Beijing';
Query OK, 0 rows affected (0.08 sec)

mysql> INSERT INTO `person` (`id`, `NAME`, `address_info`) VALUES('1','a','{\"city\": \"Beijing\"}');
Query OK, 1 row affected (0.01 sec)

mysql> EXPLAIN ANALYZE SELECT NAME, id, city FROM person WHERE address_info->>"$.city" = 'Beijing';
+---------------------------+----------+---------+-----------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+-----------+------+
| id                        | estRows  | actRows | task      | access object | execution info                                                                                                                                                                                                                           | operator info                                                                                             | memory    | disk |
+---------------------------+----------+---------+-----------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+-----------+------+
| Projection_4              | 8000.00  | 1       | root      |               | time:640µs, loops:2, Concurrency:5                                                                                                                                                                                                      | test.person.name, test.person.id, test.person.city                                                        | 19.2 KB   | N/A  |
| └─TableReader_7       | 8000.00  | 1       | root      |               | time:586.1µs, loops:2, cop_task: {num: 1, max: 489.7µs, proc_keys: 1, rpc_num: 1, rpc_time: 472.4µs, copr_cache_hit_ratio: 0.00}                                                                                                      | data:Selection_6                                                                                          | 328 Bytes | N/A  |
|   └─Selection_6       | 8000.00  | 1       | cop[tikv] |               | tikv_task:{time:0s, loops:1}, scan_detail: {total_process_keys: 1, total_process_keys_size: 72, total_keys: 2, rocksdb: {delete_skipped_count: 1, key_skipped_count: 2, block: {cache_hit_count: 1, read_count: 0, read_byte: 0 Bytes}}} | eq(json_unquote(cast(json_extract(test.person.address_info, "$.city"), var_string(16777216))), "Beijing") | N/A       | N/A  |
|     └─TableFullScan_5 | 10000.00 | 1       | cop[tikv] | table:person  | tikv_task:{time:0s, loops:1}                                                                                                                                                                                                             | keep order:false, stats:pseudo                                                                            | N/A       | N/A  |
+---------------------------+----------+---------+-----------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+-----------+------+

It seems the Generated columns index replacement rule not work in this case.

2. What did you expect to see? (Required)

Generated columns index replacement rule supports JSON functions, like MySQL.

3. What did you see instead (Required)

Current version of TiDB not support.

4. What is your TiDB version? (Required)

6.1.0

sunnyfun888 avatar Jun 18 '22 09:06 sunnyfun888

Hi, @sunnyfun888 JSON_UNQUTE is a special function that hard to apply the optimization. You can work around with

mysql> alter table person add index idx((cast(json_extract(`address_info`, _utf8mb4'$.city') as char(64))));
Query OK, 0 rows affected (2.54 sec)

mysql> EXPLAIN ANALYZE SELECT NAME, id, city FROM person WHERE cast(json_extract(`address_info`, _utf8mb4'$.city') as char(64)) = 'Beijing';
+------------------------------------+---------+---------+-----------+-------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------+-----------+------+
| id                                 | estRows | actRows | task      | access object                                                                             | execution info                                                                                                                                                          | operator info                                                                                                            | memory    | disk |
+------------------------------------+---------+---------+-----------+-------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------+-----------+------+
| Projection_4                       | 10.00   | 0       | root      |                                                                                           | time:892.8µs, loops:1, Concurrency:OFF                                                                                                                                  | test.person.name, test.person.id, test.person.city                                                                       | 2.23 KB   | N/A  |
| └─Projection_12                    | 10.00   | 0       | root      |                                                                                           | time:889µs, loops:1, Concurrency:OFF                                                                                                                                    | test.person.id, test.person.name, test.person.city, cast(json_extract(test.person.address_info, $.city), var_string(64)) | 2.86 KB   | N/A  |
|   └─IndexLookUp_11                 | 10.00   | 0       | root      |                                                                                           | time:887.8µs, loops:1,                                                                                                                                                  |                                                                                                                          | 163 Bytes | N/A  |
|     ├─IndexRangeScan_9(Build)      | 10.00   | 0       | cop[tikv] | table:person, index:idx(cast(json_extract(`address_info`, _utf8mb4'$.city') as char(64))) | time:577.9µs, loops:1, cop_task: {num: 1, max: 376.3µs, proc_keys: 0, rpc_num: 1, rpc_time: 301.6µs, copr_cache_hit_ratio: 0.00}, tikv_task:{time:229.8µs, loops:0}     | range:["Beijing","Beijing"], keep order:false, stats:pseudo                                                              | N/A       | N/A  |
|     └─TableRowIDScan_10(Probe)     | 10.00   | 0       | cop[tikv] | table:person                                                                              |                                                                                                                                                                         | keep order:false, stats:pseudo                                                                                           | N/A       | N/A  |
+------------------------------------+---------+---------+-----------+-------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------+-----------+------+
5 rows in set (0.01 sec)

wjhuang2016 avatar Jun 21 '22 11:06 wjhuang2016

Hi @wjhuang2016, We're trying to migrate a legacy system from MySQL to TiDB. During the migration research, this issue was found. There're a variety of data types in the json(string, number, datetime), all was fine when use virtual columns with indexes in MySQL, if use this workaround, the application need to record every data types instead of DB, maybe the cost will be a little big.

sunnyfun888 avatar Jun 23 '22 05:06 sunnyfun888

Hi @wjhuang2016, We're trying to migrate a legacy system from MySQL to TiDB. During the migration research, this issue was found. There're a variety of data types in the json(string, number, datetime), all was fine when use virtual columns with indexes in MySQL, if use this workaround, the application need to record every data types instead of DB, maybe the cost will be a little big.

The problem is that the expression type and the generated column type need to be strictly equal. MySQL doesn't follow it, and it causes a bug in MySQL. For example, the result of JSON_UNQUOTE(address_info->"$.city") may be longer than 64, so it cannot be stored in city without truncation. If we replace it by city, the result may be wrong.

wjhuang2016 avatar Jun 23 '22 06:06 wjhuang2016

The problem is that the expression type and the generated column type need to be strictly equal. MySQL doesn't follow it, and it causes a bug in MySQL. For example, the result of JSON_UNQUOTE(address_info->"$.city") may be longer than 64, so it cannot be stored in city without truncation. If we replace it by city, the result may be wrong.

I think all other virtual columns use expressions will meet the same problem, infact in MySQL if the string is longer than 64 when insert or update, will get Data too long for column error, and the truncation is rollbacked. The behavior of MySQL is more like syntactic sugar.

sunnyfun888 avatar Jun 23 '22 07:06 sunnyfun888

The problem is that the expression type and the generated column type need to be strictly equal. MySQL doesn't follow it, and it causes a bug in MySQL. For example, the result of JSON_UNQUOTE(address_info->"$.city") may be longer than 64, so it cannot be stored in city without truncation. If we replace it by city, the result may be wrong.

I think all other virtual columns use expressions will meet the same problem, infact in MySQL if the string is longer than 64 when insert or update, will get Data too long for column error, and the truncation is rollbacked. The behavior of MySQL is more like syntactic sugar.

@sunnyfun888 For the strict sql-mode, it would report an error. But if in the empty sql-mode, it can insert successfully.

wjhuang2016 avatar Jun 23 '22 08:06 wjhuang2016

 if the string is longer than 64 when insert or update, will get Data too long for column error, and the truncation will be rollbacked

It's not true if the sql_mode is "".

set @@sql_mode="";

If users can accept the possible wrong result, it's ok the do this optimization. But when designed this feature, we tend to choose the correctness.

wjhuang2016 avatar Jun 24 '22 08:06 wjhuang2016

If users can accept the possible wrong result, it's ok the do this optimization. But when designed this feature, we tend to choose the correctness.

I agree and can use a switch then leave the choice to users.

sunnyfun888 avatar Jun 25 '22 05:06 sunnyfun888

it is a mysql bug, consider the following case

CREATE TABLE person (id INT PRIMARY KEY,address_info JSON,city VARCHAR(2) AS (JSON_UNQUOTE(address_info->"$.city")),KEY (city));
set @@sql_mode='';
INSERT INTO `person` (`id`, `address_info`) VALUES('1','{\"city\": \"Beijing\"}');  -- success with warning
set @@sql_mode=default;
SELECT id FROM person ignore index(`city`) WHERE address_info->>"$.city" = 'Beijing';  -- 1 record
SELECT id FROM person force index(`city`) WHERE address_info->>"$.city" = 'Beijing';  -- 0 record

use or not use an index will get a different result, it is absolutely a bug

xiongjiwei avatar Aug 11 '22 07:08 xiongjiwei

close due to MySQL has confirmed it is a bug https://bugs.mysql.com/bug.php?id=108114

xiongjiwei avatar Aug 19 '22 03:08 xiongjiwei