FR: Use SHOW INDEX STATUS output to derive data distribution

Open markusalbe opened this issue 7 years ago • 0 comments

I believe we could use output of SHOW INDEX STATUS to find out how many unique values each column should present.

+------------------------+------------+--------------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table                  | Non_unique | Key_name           | Seq_in_index | Column_name     | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+------------------------+------------+--------------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| performance_by_affcode |          0 | PRIMARY            |            1 | id              | A         |  2183073712 |     NULL | NULL   |      | BTREE      |         |               |
| performance_by_affcode |          0 | affcode_per_day    |            1 | profile_id      | A         |        1113 |     NULL | NULL   | YES  | BTREE      |         |               |
| performance_by_affcode |          0 | affcode_per_day    |            2 | ad_id_in_target | A         |     8838355 |     NULL | NULL   | YES  | BTREE      |         |               |
| performance_by_affcode |          0 | affcode_per_day    |            3 | affcode         | A         |    33076874 |     NULL | NULL   |      | BTREE      |         |               |
| performance_by_affcode |          0 | affcode_per_day    |            4 | date            | A         |  2183073712 |     NULL | NULL   |      | BTREE      |         |               |
| performance_by_affcode |          1 | campaign_id_date   |            1 | campaign_id     | A         |         447 |     NULL | NULL   | YES  | BTREE      |         |               |
| performance_by_affcode |          1 | campaign_id_date   |            2 | date            | A         |      326563 |     NULL | NULL   |      | BTREE      |         |               |
| performance_by_affcode |          1 | last_updated       |            1 | last_updated    | A         |    13311425 |     NULL | NULL   |      | BTREE      |         |               |
| performance_by_affcode |          1 | profile_ad_id_date |            1 | profile_id      | A         |         919 |     NULL | NULL   | YES  | BTREE      |         |               |
| performance_by_affcode |          1 | profile_ad_id_date |            2 | ad_id_in_target | A         |    10754057 |     NULL | NULL   | YES  | BTREE      |         |               |
| performance_by_affcode |          1 | profile_ad_id_date |            3 | date            | A         |  2183073712 |     NULL | NULL   |      | BTREE      |         |               |
| performance_by_affcode |          1 | profile_affcode    |            1 | profile_id      | A         |         763 |     NULL | NULL   | YES  | BTREE      |         |               |
| performance_by_affcode |          1 | profile_affcode    |            2 | affcode         | A         |    31638749 |     NULL | NULL   |      | BTREE      |         |               |
| performance_by_affcode |          1 | profile_id_date    |            1 | profile_id      | A         |         723 |     NULL | NULL   | YES  | BTREE      |         |               |
| performance_by_affcode |          1 | profile_id_date    |            2 | date            | A         |      265387 |     NULL | NULL   |      | BTREE      |         |               |
+------------------------+------------+--------------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+

The maximum cardinality value would be the total rows to generate, and you can nest-loop from largest to smallest. So if you had columns A, B and C with cardinality 2, 3 and 6 respectively you would end up with

A |  B  |  C
---+---+----
1  |  1   |  1  
2  |  2   |  2 
1  |  3   |  3 
2  |  1   |  4
1  |  2   |  5
2  |  3   |  6

[note for usage: if an index does not exists customer could provide SELECT COUNT(DISTINCT...) for the relevant columns and we could manually add lines to the output based on that to test on our side]

Jun 02 '18 22:06 markusalbe